Sentence Transformers专注于句子和文本嵌入,支持超过100种语言。利用深度学习技术,特别是Transformer架构的优势,将文本转换为高维向量空间中的点,使得相似的文本在几何意义上更接近。
💥pip安装:
pip install -U sentence-transformers
💥conda安装:
conda install -c conda-forge sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# 加载all-MiniLM-L6-v2,这是一个在超过 10 亿个训练对的大型数据集上微调的 MiniLM 模型
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# 计算所有句子对之间的相似度
similarities = model.similarity(embeddings, embeddings)
print(similarities)
输出:
💯Cross Encoder (又名 reranker) 模型的用法与 Sentence Transformers 类似:
from sentence_transformers.cross_encoder import CrossEncoder
# 我们选择要加载的CrossEncoder模型
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")
# 定义查询句子和语料库
query = "A man is eating pasta."
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]
# 对句子进行排名
ranks = model.rank(query, corpus)
print("Query: ", query)
for rank in ranks:
print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
import numpy as np
# 使用 NumPy 进行排序
sentence_combinations = [[query, sentence] for sentence in corpus]
scores = model.predict(sentence_combinations)
ranked_indices = np.argsort(scores)[::-1]
print("Scores:", scores)
print("Indices:", ranked_indices)
输出:
💫对于语义文本相似度 (STS),我们希望为所有相关文本生成嵌入并计算它们之间的相似度。相似度得分最高的文本对在语义上最相似
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences1 = [
"The new movie is awesome",
"The cat sits outside",
"A man is playing guitar",
]
sentences2 = [
"The dog plays in the garden",
"The new movie is so great",
"A woman watches TV",
]
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
similarities = model.similarity(embeddings1, embeddings2)
for idx_i, sentence1 in enumerate(sentences1):
print(sentence1)
for idx_j, sentence2 in enumerate(sentences2):
print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")
💫输出:
可以通过多种方式改变此值:
1. 通过使用所需的相似度函数初始化 SentenceTransformer 实例:
from sentence_transformers import SentenceTransformer, SimilarityFunction
model = SentenceTransformer("all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.DOT_PRODUCT)
2. 通过直接在 SentenceTransformer 实例上设置值:
from sentence_transformers import SentenceTransformer, SimilarityFunction
model = SentenceTransformer("all-MiniLM-L6-v2")
model.similarity_fn_name = SimilarityFunction.DOT_PRODUCT
Sentence Transformers 实现了两种方法来计算嵌入之间的相似度
from sentence_transformers import SentenceTransformer, SimilarityFunction
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
model.similarity_fn_name = SimilarityFunction.MANHATTAN
print(model.similarity_fn_name)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
输出:
🧊语义搜索旨在通过理解搜索查询的语义含义和要搜索的语料库来提高搜索准确率。与只能根据词汇匹配查找文档的关键字搜索引擎不同,语义搜索在给定同义词、缩写和拼写错误的情况下也能表现良好。
语义搜索背后的理念是将语料库中的所有条目(无论是句子、段落还是文档)嵌入到向量空间中。在搜索时,查询被嵌入到相同的向量空间中,并从语料库中找到最接近的嵌入。这些条目应该与查询具有较高的语义相似度。
🧊我们设置的一个关键区别是对称与非对称语义搜索:
对于小型语料库(最多约 100 万个条目),我们可以通过手动实现语义搜索,即计算语料库和查询的嵌入
import torch
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
queries = [
"A man is eating pasta.",
"Someone in a gorilla costume is playing a set of drums.",
"A cheetah chases prey on across a field.",
]
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
similarity_scores = embedder.similarity(query_embedding, corpus_embeddings)[0]
scores, indices = torch.topk(similarity_scores, k=top_k)
print("\nQuery:", query)
print("Top 5 most similar sentences in corpus:")
for score, idx in zip(scores, indices):
print(corpus[idx], f"(Score: {score:.4f})")
输出:
🍹我们也可以不用自己实现语义搜索 ,可以使用util.semantic_search函数
sentence_transformers.util.semantic_search(query_embeddings: Tensor, corpus_embeddings: Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: Callable[[Tensor, Tensor], Tensor] = <function cos_sim>) →列表[列表[字典[ str , int | float ] ] ]
corpus_embeddings = corpus_embeddings.to("cuda")
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
query_embeddings = query_embeddings.to("cuda")
query_embeddings = util.normalize_embeddings(query_embeddings)
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)
对于复杂的搜索任务,例如问答检索,使用检索和重新排名可以显著提高搜索质量。
给定一个搜索查询,我们首先使用一个检索系统来检索一个大列表,例如 100 个可能与该查询相关的结果。对于检索,我们可以使用词汇搜索,例如使用 Elasticsearch 之类的矢量引擎,或者我们可以使用双编码器进行密集检索。但是,检索系统可能会检索与搜索查询不太相关的文档。
🌤️双编码器会为段落和搜索查询独立生成嵌入
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
docs = [
"My first paragraph. That contains information",
"Python is a programming language.",
]
document_embeddings = model.encode(docs)
print('document_embeddings:',document_embeddings)
query = "What is Python?"
query_embedding = model.encode(query)
print('query_embedding:',query_embedding)
输出: