腾讯云向量数据库(Tencent Cloud VectorDB)默认开通 Embedding 功能。您需要在创建 Collection 时,指定 Embedding 模型,并配置相关参数,才能在写入数据、更新数据、检索数据时应用 Embedding 能力。
创建 Collection 指定 Embedding 模型
如下请求示例,应用 /collection/create 创建数据库表 book-emb。其中,embedding 参数中 field 指定了文本信息的字段为 text,vectorField 指定了文本信息转换为向量之后存储的字段,而 model 则指定了 Embedding 模型。更多信息,请参见/collection/create。
注意:
如下示例可直接复制,在 CVM 运行之前,您需在文本编辑器将 api_key=A5VOgsMpGWJhUI0WmUbY******************** 与 10.0.X.X 依据实际情况进行替换。
配置 Embedding 参数,若不配置 indexs 中的 dimension 参数,则 dimension 将自动配置为 Embedding 模型对应的向量维度。如果配置的 dimension 与 Embedding 模型对应的向量维度不一致,会提示错误信息。 Embedding 模型对应的向量维度,请参见模型信息。
curl -i -X POST \\-H 'Content-Type: application/json' \\-H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\http://10.0.X.X:80/collection/create \\-d '{"database": "db-test","collection": "book-emb","replicaNum": 2,"shardNum": 1,"description": "this is the collection description","embedding": {"field": "text","vectorField": "vector","model": "bge-base-zh"},"indexes": [{"fieldName": "id","fieldType": "string","indexType": "primaryKey"},{"fieldName": "vector","fieldType": "vector","indexType": "HNSW","metricType": "COSINE","params": {"M": 16,"efConstruction": 200}},{"fieldName": "bookName","fieldType": "string","indexType": "filter"},{"fieldName": "author","fieldType": "string","indexType": "filter"}]}'
写入文本数据
使用 /document/upsert 给数据库为 db-test,Collection 为 book-emb 批量插入数据。如下示例中,通过字段 text 传入原始文本数据。text 则为创建 Collection 时 Emdedding 参数 field 对应指定的文本字段名(示例中定义为 text)。
说明:
若不确定该 Collection 是否配置 Embedding 模型,写入数据之前,可通过 /collection/describe 查看 Emdedding 参数 status 是否为 enabled。
curl -i -X POST \\-H 'Content-Type: application/json' \\-H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\http://10.0.X.X:80/document/upsert \\-d '{"database": "db-test","collection": "book-emb","buildIndex": true,"documents": [{"id": "0001","text": "话说天下大势,分久必合,合久必分。","author": "罗贯中","bookName": "三国演义","page": 21},{"id": "0002","text": "混沌未分天地乱,茫茫渺渺无人间。","author": "吴承恩","bookName": "西游记","page ": 22},{"id": "0003","text": "甄士隐梦幻识通灵,贾雨村风尘怀闺秀。","author": "曹雪芹","bookName": "红楼梦","page": 23}]}'
说明:
检索数据
如下示例,使用 /document/search 接口,在集合 book-emb 中,检索与 embeddingItems 参数的文本信息相似,且满足 Filter 表达式
"bookName in (\\"三国演义\\",\\"西游记\\")"
的文档。ef 为 HWSN 索引类型对应的参数,指定寻找节点邻居遍历的范围,默认为200,ef 越大,召回率越高。
outputFields 可配置所需输出的字段。
curl -i -X POST \\-H 'Content-Type: application/json' \\-H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\http://10.0.X.X:80/document/search \\-d '{"database": "db-test","collection": "book-emb","search": {"embeddingItems": ["天下大势,分久必合,合久必分"],"limit": 3,"params": {"ef": 200},"retrieveVector": false,"filter": "bookName in (\\"三国演义\\",\\"西游记\\")","outputFields": ["id","author","text","bookName"]}}'
检索结果如下所示,其中,score 为相似性得分,使用 COSINE 进行相似度计算,其值越大越相似。text 字段为创建集合时定义的写入文本的字段名,存储原始文本。
{"code": 0,"msg": "operation success","documents": [[{"id": "0001","score": 0.9792741537094116,"bookName": "三国演义","author": "罗贯中","text": "话说天下大势,分久必合,合久必分。"},{"id": "0002","score": 0.7909858226776123,"bookName": "西游记","author": "吴承恩","text": "混沌未分天地乱,茫茫渺渺无人间。"}]]}
更新数据
使用 /document/update 接口更新数据,如下示例,通过 documentIds 与 filter 表达式过滤 Document,更新其 text 字段的文本信息,更新 page 字段值为 30,并新增字段 test_new_field,且 vector 字段的向量数据自动更新。
curl -i -X POST \\-H 'Content-Type: application/json' \\-H 'Authorization: Bearer account=root&api_key=A5VOgsMpGWJhUI0WmUbY********************' \\http://10.0.X.X:80/document/update \\-d '{"database": "db-test","collection": "book-emb","query": {"documentIds": ["0001","0003"],"filter": "bookName in (\\"三国演义\\",\\"西游记\\")"},"update": {"text": "合久必分,分久必合","page": 30,"test_new_field": "new field value"}}'
执行成功之后,返回如下信息。
{"code": 0,"msg": "operation success","affectedCount": 1}
通过 /document/query 查询 Document ID为 0001 的数据,确认更新的字段是否生效。返回如下信息,可看到 text 字段与 page 字段值已更新,新增字段 test_new_field 也已生效。
{"code": 0,"msg": "operation success","count": 1,"documents": [{"id": "0001","author": "罗贯中","test_new_field": "new field value","bookName": "三国演义","page": 30,"text": "合久必分,分久必合"}]}