ES 通过 nested 允许单字段支持向量数组,向量个数可以不定,这个特性非常有用,让我们来看一个视频检索的示例。
示例背景
假设我们需要针对视频进行向量检索,挑战是每条视频抽帧后的图片数量不一样,此时我们可以通过 nested 嵌套字段来实现。
举例:一条视频信息有 id、title、content、image 字段,其中 image 保存视频抽帧后的图片向量数据,每个视频有 n 张图,n 值不固定。
创建索引
创建 mappings 时指定 image 为 nested 嵌套字段:
//id、title、content 为文本字段//image中的num为图片编号,比如第几帧(可不要)//image中的emb为图片embedding之后的数据,image[image1_emb、image2_emb、image3_emb...imagen_emb]PUT /image_embeddings{"mappings": {"properties": {"id": {"type": "keyword"},"title": {"type": "text"},"content": {"type": "text"},"image": {"type": "nested","properties": {"num": {"type": "keyword"},"emb": {"type": "dense_vector","dims": 5,"index_options": {"type": "int8_hnsw"},"similarity": "cosine"}}}}}}
写入数据
写入格式用数组形式将多组向量包起来,格式如下:
POST /image_embeddings/_doc/1{"id": "book_001","title": "晴天","content": "刮风这天,我试着握着你手","image": [{"num": "0", "emb": [0.1,0.2,0.3,0.4,0.5]},{"num": "1", "emb": [0.6,0.7,0.8,0.9,1.0]},{"num": "2", "emb": [0.2,0.3,0.4,0.5,0.6]}]}POST /image_embeddings/_doc/2{"id": "book_002","title": "一路向北","content": "一路向北,我试着握着你手","image": [{"num": "0", "emb": [0.1,0.2,0.3,0.4,0.5]},{"num": "1", "emb": [0.6,0.7,0.8,0.9,1.0]}]}POST /image_embeddings/_doc/3{"id": "book_003","title": " 双截棍","content": "哼哼哈嘿","image": [{"num": "0", "emb": [0.1,0.2,0.3,0.4,0.5]},{"num": "1", "emb": [0.6,0.7,0.8,0.9,1.0]},{"num": "2", "emb": [0.1,0.2,0.3,0.4,0.5]},{"num": "3", "emb": [0.6,0.7,0.8,0.9,1.0]},{"num": "4", "emb": [0.1,0.2,0.3,0.4,0.5]},{"num": "5", "emb": [0.6,0.7,0.8,0.9,1.0]}]}
执行向量搜索
查询数据,写法与前述混合检索算法一致,nested 嵌套字段的评分将取 max 作为最终评分。
GET book-index/_search{"retriever": {"rrf": {"retrievers": [{"retriever": {"knn": {"field": "image.emb","query_vector": [0.1, 0.2, 0.3,0.4,0.5],"k": 5,"num_candidates": 50}},"weight": 0.8},{"retriever": {"standard": {"query": {"match": {"title": "晴天"}}}},"weight": 0.2}],"rank_window_size": 50,"rank_constant": 20}}}