Elasticsearch:介绍 kNN query,这是进行 kNN 搜索的专家方法

作者:来自 Elastic Mayya Sharipova, Benjamin Trent

当前状况:kNN 搜索作为顶层部分

Elasticsearch 中的 kNN 搜索被组织为搜索请求的顶层(top level)部分。 我们这样设计是为了:

  • 无论分片数量多少,它总是可以返回全局 k 个最近邻居
  • 这些全局 k 个结果与其他查询的结果相结合以形成混合搜索
  • 全局 k 结果被传递到聚合以形成统计(facets)。

这是 kNN 搜索在内部执行的简化图(省略了一些阶段):

图 1:顶层 kNN 搜索的步骤是:

  1. 用户提交搜索请求
  2. 协调器节点在 DFS 阶段向数据节点发送请求的 kNN 搜索部分
  3. 每个数据节点运行 kNN 搜索并将本地 top-k 结果发送回协调器
  4. 协调器合并所有本地结果以形成全局前 k 个最近邻居。
  5. 协调器将全局 k 个最近邻居发送回数据节点,并提供任何其他查询
  6. 每个数据节点运行额外的查询并将本地 size 结果发送回协调器
  7. 协调器合并所有本地结果并向用户发送响应

我们首先在 DFS 阶段运行 kNN 搜索以获得全局前 k 个结果。 然后,这些全局 k 结果被传递到搜索请求的其他部分,例如其他查询或聚合。 即使执行看起来很复杂,但从用户的角度来看,运行 kNN 搜索的模型很简单,因为用户始终可以确保 kNN 搜索返回全局 k 结果。

它的请求格式如下:

GET collection-with-embeddings/_search
{"knn": {"field": "text_embedding.predicted_value","query_vector_builder": {"text_embedding": {"model_id": "sentence-transformers__msmarco-distilbert-base-tas-b","model_text": "How is the weather in Jamaica?"}},"k": 10,"num_candidates": 100},"_source": ["id","text"]
}

引入 kNN 查询

随着时间的推移,我们意识到还需要将 kNN 搜索表示为查询。 查询是 Elasticsearch 中搜索请求的核心组件,将 kNN 搜索表示为查询可以灵活地将其与其他查询结合起来,以解决更复杂的请求。

kNN 查询与顶层 kNN 搜索不同,没有 k 参数。 与其他查询一样,返回的结果(最近邻居)的数量由 size 参数定义。 与 kNN 搜索类似,num_candidates 参数定义在执行 kNN 搜索时在每个分片上考虑多少个候选者。

GET products/_search
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10}}
}

kNN 查询的执行方式与顶层 kNN 搜索不同。 下面是一个简化图,描述了 kNN 查询如何在内部执行(省略了一些阶段):

图 2:基于查询的 kNN 搜索步骤如下:

  • 用户提交搜索请求
  • 协调器向数据节点发送一个 kNN 搜索查询,并提供附加查询
  • 每个数据节点运行查询并将本地大小结果发送回协调器节点
  • 协调器节点合并所有本地结果并向用户发送响应

我们在一个分片上运行 kNN 搜索以获得 num_candidates 结果; 这些结果将传递给分片上的其他查询和聚合,以从分片获取大小结果。 由于我们不首先收集全局 k 个最近邻居,因此在此模型中,收集的且对其他查询和聚合可见的最近邻居的数量取决于分片的数量。

kNN 查询 API 示例

让我们看一下 API 示例,这些示例演示了顶层 kNN 搜索和 kNN 查询之间的差异。

我们创建产品索引并索引一些文档:

PUT products
{"mappings": {"dynamic": "strict","properties": {"department": {"type": "keyword"},"brand": {"type": "keyword"},"description": {"type": "text"},"embedding": {"type": "dense_vector","index": true,"similarity": "l2_norm"},"price": {"type": "float"}}}
}
POST products/_bulk?refresh=true
{"index":{"_id":1}}
{"department":"women","brand": "Levi's", "description":"high-rise red jeans","embedding":[1,1,1,1],"price":100}
{"index":{"_id":2}}
{"department":"women","brand": "Calvin Klein","description":"high-rise beautiful jeans","embedding":[1,1,1,1],"price":250}
{"index":{"_id":3}}
{"department":"women","brand": "Gap","description":"every day jeans","embedding":[1,1,1,1],"price":50}
{"index":{"_id":4}}
{"department":"women","brand": "Levi's","description":"jeans","embedding":[2,2,2,0],"price":75}
{"index":{"_id":5}}
{"department":"women","brand": "Levi's","description":"luxury jeans","embedding":[2,2,2,0],"price":150}
{"index":{"_id":6}}
{"department":"men","brand": "Levi's", "description":"jeans","embedding":[2,2,2,0],"price":50}
{"index":{"_id":7}}
{"department":"women","brand": "Levi's", "description":"jeans 2023","embedding":[2,2,2,0],"price":150}

kNN 查询类似于顶层 kNN 搜索,具有 num_candidates 和充当预过滤器的内部 filter 参数。

GET products/_search?filter_path=**.hits
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}}
} 
{"hits": {"hits": [{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75}},{"_index": "products","_id": "5","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150}},{"_index": "products","_id": "7","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans 2023","embedding": [2,2,2,0],"price": 150}}]}
}

kNN 查询比 kNN collapsing 和聚合搜索可以获得更多样化的结果。 对于下面的 kNN 查询,我们在每个分片上执行 kNN 搜索以获得 10 个最近邻居,然后将其传递到 collapsing 以获取 3 个顶部结果。 因此,我们将在响应中得到 3 个不同的点击。

GET products/_search?filter_path=**.hits
{"size" : 3,"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}},"collapse": {"field": "brand"        }
}
{"hits": {"hits": [{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"fields": {"brand": ["Levi's"]}},{"_index": "products","_id": "2","_score": 0.2,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250},"fields": {"brand": ["Calvin Klein"]}},{"_index": "products","_id": "3","_score": 0.2,"_source": {"department": "women","brand": "Gap","description": "every day jeans","embedding": [1,1,1,1],"price": 50},"fields": {"brand": ["Gap"]}}]}
}

顶层 kNN 搜索首先在 DFS 阶段获取全局前 3 个结果,然后在查询阶段将它们传递到 collapse。 我们在响应中只会得到 1 个命中,因为全球 3 个最近的邻居恰好都来自同一品牌。

与聚合类似,kNN query 允许我们获得 3 个不同的存储桶,而 kNN search 仅允许 1 个。

GET products/_search?filter_path=aggregations
{
"size": 0,
"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"filter" : {"term" : {"department" : "women"}}}},"aggs": {"brands": {"terms": {"field": "brand"}}}
}
{"aggregations": {"brands": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "Levi's","doc_count": 4},{"key": "Calvin Klein","doc_count": 1},{"key": "Gap","doc_count": 1}]}}
}

而顶层的 search 是这样的:

GET products/_search?filter_path=aggregations
{"size": 0,"knn": {"field": "embedding","query_vector": [2,2,2,0],"k": 3,"num_candidates": 10,"filter": {"term": {"department": "women"}}},"aggs": {"brands": {"terms": {"field": "brand"}}}
}
{"aggregations": {"brands": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "Levi's","doc_count": 3}]}}
}

现在,让我们看一下其他示例,展示 kNN 查询的灵活性。 具体来说,它如何能够灵活地与其他查询结合起来。

kNN 可以是 boolean 查询的一部分(需要注意的是,所有外部查询过滤器都用作 kNN 搜索的后过滤器)。 我们可以使用 kNN 查询的 _name 参数来通过额外信息来增强结果,这些信息告诉 kNN 查询是否匹配及其分数贡献。

GET products/_search?include_named_queries_score
{"size": 3,"query": {"bool": {"should": [{"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}},{"match": {"description": {"query": "luxury","_name": "bm25query"}}}]}}
}
{"took": 2,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 2.8042283,"hits": [{"_index": "products","_id": "5","_score": 2.8042283,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": {"knn_query": 1,"bm25query": 1.8042282}},{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": {"knn_query": 1}},{"_index": "products","_id": "6","_score": 1,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": {"knn_query": 1}}]}
}

kNN 也可以是复杂查询的一部分,例如 pinned 查询。 当我们想要显示最接近的结果,但又想要提升选定数量的其他结果时,这非常有用。

GET products/_search
{"size": 3,"query": {"pinned": {"ids": [ "1", "2" ],"organic": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}}}}
}
{"took": 9,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 1.7014124e+38,"hits": [{"_index": "products","_id": "1","_score": 1.7014124e+38,"_source": {"department": "women","brand": "Levi's","description": "high-rise red jeans","embedding": [1,1,1,1],"price": 100},"matched_queries": ["knn_query"]},{"_index": "products","_id": "2","_score": 1.7014122e+38,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250},"matched_queries": ["knn_query"]},{"_index": "products","_id": "4","_score": 1,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]}]}
}

我们甚至可以将 kNN 查询作为 function_score 查询的一部分。 当我们需要为 kNN 查询返回的结果定义自定义分数时,这非常有用:​

GET products/_search
{"size": 3,"query": {"function_score": {"query": {"knn": {"field": "embedding","query_vector": [2,2,2,0],"num_candidates": 10,"_name": "knn_query"}},"functions": [{"filter": { "match": { "department": "men" } },"weight": 100},{"filter": { "match": { "department": "women" } },"weight": 50}]}}
}
{"took": 3,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 100,"hits": [{"_index": "products","_id": "6","_score": 100,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": ["knn_query"]},{"_index": "products","_id": "4","_score": 50,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]},{"_index": "products","_id": "5","_score": 50,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": ["knn_query"]}]}
}

当我们想要组合 kNN 搜索和其他查询的结果时,kNN 查询作为 dis_max 查询的一部分非常有用,以便文档的分数来自排名最高的子句,并为任何其他子句提供打破平局的增量。

GET products/_search
{"size": 5,"query": {"dis_max": {"queries": [{"knn": {"field": "embedding","query_vector": [2,2, 2,0],"num_candidates": 3,"_name": "knn_query"}},{"match": {"description": "high-rise jeans"}}],"tie_breaker": 0.8}}
}
{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 7,"relation": "eq"},"max_score": 1.890432,"hits": [{"_index": "products","_id": "1","_score": 1.890432,"_source": {"department": "women","brand": "Levi's","description": "high-rise red jeans","embedding": [1,1,1,1],"price": 100}},{"_index": "products","_id": "2","_score": 1.890432,"_source": {"department": "women","brand": "Calvin Klein","description": "high-rise beautiful jeans","embedding": [1,1,1,1],"price": 250}},{"_index": "products","_id": "4","_score": 1.0679927,"_source": {"department": "women","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 75},"matched_queries": ["knn_query"]},{"_index": "products","_id": "6","_score": 1.0679927,"_source": {"department": "men","brand": "Levi's","description": "jeans","embedding": [2,2,2,0],"price": 50},"matched_queries": ["knn_query"]},{"_index": "products","_id": "5","_score": 1.0556482,"_source": {"department": "women","brand": "Levi's","description": "luxury jeans","embedding": [2,2,2,0],"price": 150},"matched_queries": ["knn_query"]}]}
}

kNN 搜索作为查询已在 8.12 版本中引入。 请尝试一下,如果有任何反馈,我们将不胜感激。

原文:Introducing kNN query, an expert way to do kNN search — Elastic Search Labs

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/423149.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

EHS管理系统为何需要物联网的加持?

EHS是Environment、Health、Safety的缩写,是从欧美企业引进的管理体系,在国外也被称为HSE。EHS是指健康、安全与环境一体化的管理。 而在国内,整个EHS市场一共被分成三类; 一类是EHS管培体系,由专门的EHS机构去为公司…

二维码地址门牌管理系统:社区新风向

文章目录 前言一、集成先进技术的系统二、便捷居民体验三、支持社区管理四、未来展望与可扩展性 前言 随着科技的不断发展,智能化管理已经深入到我们的生活中。二维码门牌管理系统作为一款创新产品,在社区管理领域迅速引起广泛关注。这款系统不仅提升了…

一、认识 JVM 规范(JVM 概述、字节码指令集、Class文件解析、ASM)

1. JVM 概述 JVM:Java Virtual Machine,也就是 Java 虚拟机 所谓虚拟机是指:通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的计算机系统。 即:虚拟机是一个计算机系统。这种计算机系统运行在完全隔离的环境中…

新书速览|MediaPipe机器学习跨平台框架实战

MediaPipe助你高效构建移动端短视频应用。MediaPipe、机器学习、短视频应用、视频特效、游戏控制 本书内容 《MediaPipe机器学习跨平台框架实战》以实际项目为线索,带领读者探索MediaPipe在不同场景中的应用,使读者既能了解理论知识,又能通过…

VisualSVN Server下载安装和使用方法、服务器搭建、使用TortoiseSvn将项目上传到云端服务器、各种错误解决方法

VisualSVN Server下载安装和使用方法、服务器搭建、使用TortoiseSvn将项目上传到云端服务器、各种错误解决方法 0.写在前面00.电脑配置01.思路 1.VisualSVN Server下载安装01.下载02.安装03.电脑命名不能有中文04.制作VisualSVN Server快捷方式05.License limits exceeded, Som…

HarmonyOS 发送http网络请求

好 本文 我们来说 http请求 首先 我们要操作网络内容 需要申请权限 项目中找到 main目录下的module.json5 最下面加上 "requestPermissions": [{"name": "ohos.permission.INTERNET"} ]这里 我在本地写了一个get接口 大家可以想办法 弄一个后…

RK3568平台 TinyAlsa集成第三方音频算法

一.tinyalsa介绍 ALSA(Advanced Linux Sound Architecture)是一个开源项目,涵盖了用户空间和内核空间对音频设备的操作接口,通过应用层使用alsalib可以实现对音频设备的控制 TinyAlsa是android推出的一个精简的ALSA库&#xff0c…

R语言批量把数值变量和因子变量的互转

#我们以rms包的lung数据集为例 library(rms) data<-lung #这里有两种方法&#xff0c; #第1是知道需要转化的变量在哪几列&#xff1b; #第2知道需要转化的变量名 str(data) #假设我们想转化inst/status/sex/三个变量的类型 #图1先看看变量类型和处于第几列 str(dat…

[笔记]深度学习入门 基于Python的理论与实现(五)

5. 误差反向传播法 上一节介绍了神经网络的学习&#xff0c;并通过数值微分计算了神经网络的权重参数的梯度&#xff08;严格地说&#xff0c;是损失函数关于权重参数的梯度&#xff09;。数值微分简单、容易实现&#xff0c;但是计算很费时间。 我们将介绍误差反向传播法&…

Cyber RT 服务通信

场景&#xff1a; 用户乘坐无人出租车过程中&#xff0c;可能临时需要切换目的地&#xff0c;用户可以通过车机系统完成修改&#xff0c;路径规划模块需要根据新的目的地信息重新规划路径&#xff0c;并反馈修正后的结果给用户&#xff0c;那么用户的修正请求数据与修正结果是如…

gradle打包分离依赖jar

正常打包的jar是包含项目所依赖的jar包资源&#xff0c;而且大多数场景下的依赖资源是不会频繁的变更的&#xff0c;所以实际把项目自身jar和其所依赖的资源分离可以实现jar包瘦身&#xff0c;减小上传的jar包总大小&#xff0c;能实现加速部署的效果 一 原本结构 二 配置buil…

C# 控制台进度条

最简单 namespace ProcessStu01 {internal class Program{static void Main(string[] args){for (int i 1; i < 100; i){Console.Write("\r{0,3}%",i);Thread.Sleep(50);}}} }第三方库 https://github.com/Mpdreamz/shellprogressbar using ShellProgressBar…