ES应用_ES实战-编程知识

依靠知识库使用es总结一些使用技巧。

1 快速入门

ES是将查询语句写成类似json的形式，通过关键字进行查询和调用。

1.1 创建

下面创建了一个主分片为5，副本分片为1的ES结构。ES本身是一种noschema的结构，但是可以通过指定mapping编程schema的结构（具体mapping的一些用法会在后文提及）。

# 建设向量索引
PUT test
{"settings": {"number_of_shards": 5,"number_of_replicas": 1,"index.codec": "proxima","index.vector.algorithm": "hnsw"},"mappings": {"properties": {"id": {"type": "text"},"gmt_create": {"type": "text"},"gmt_modified": {"type": "text"},"title": {"type": "text"},"question_id": {"type": "text"},"category_id": {"type": "text"},"bu_id": {"type": "text"},"bu_platform": {"type": "text"},"product": {"type": "text"},"platform": {"type": "text"},"status": {"type": "text"},"creator": {"type": "text"},"modifier": {"type": "text"},"knowledge_id": {"type": "text"},"space_id": {"type": "text"},"ext_info": {"type": "text"},"lan": {"type": "text"},"default_lan": {"type": "text"},"content_type": {"type": "text"},"content": {"type": "text"},"section_type": {"type": "text"},"terminal_type": {"type": "text"},"simQuestions": {"type": "text"},"recommand": {"type": "text"},"qq_vects": {"type": "proxima_vector","dim": 128,"vector_type": "float","distance_method": "SquaredEuclidean"}}}
}

1.2 删除

delete test

1.3 查询

GET test/_search
{"query": {"match": {"product":"ding"}}
}

2 mapping使用

2.1 text类型

由于es是基于搜索引擎建立的。因此会对文本类型字段需要分词并建立倒排索引。使用该类型的优点是能够加快查询速度（50毫秒内），缺点是不支持排序（因为进行了分词倒排索引，无法实现排序）。

2.2 keyword

该字段不会进行分词，但仍然会建立索引。严格匹配的场景或者需要排序，聚合等。

例如，上述建立的表中需要指定gmt_create和gmt_modified进行排序，应当建立如下mapping


PUT test
{"settings": {"number_of_shards": 5,"number_of_replicas": 1,"index.codec": "proxima","index.vector.algorithm": "hnsw"},"mappings": {"properties": {"id": {"type": "text"},"gmt_create": {"type":"text","fields":{"row":{"type":"keyword"}},"fielddata":true},"gmt_modified": {"type":"text","fields":{"row":{"type":"keyword"}},"fielddata":true},"title": {"type": "text"},"question_id": {"type": "text"},"category_id": {"type": "text"},"bu_id": {"type": "text"},"bu_platform": {"type": "text"},"product": {"type": "text"},"platform": {"type": "text"},"status": {"type": "text"},"creator": {"type": "text"},"modifier": {"type": "text"},"knowledge_id": {"type": "text"},"space_id": {"type": "text"},"ext_info": {"type": "text"},"lan": {"type": "text"},"default_lan": {"type": "text"},"content_type": {"type": "text"},"content": {"type": "text"},"section_type": {"type": "text"},"terminal_type": {"type": "text"},"simQuestions": {"type": "text"},"recommand": {"type": "text"},"qq_vects": {"type": "proxima_vector","dim": 128,"vector_type": "float","distance_method": "SquaredEuclidean"}}}
}

其他类型不做过多介绍，参见官方文档

3 Tips

3.1 数据同步

可以使用datax构建实时/离线数据同步到es，前提是es的mapping结构能够和数据源映射

如果需要进行数据加工，使用：数据源 -> datahub -> flink -> es的链路

3.2 修改mapping

es本身是一种非schema结构，一旦index的mapping在建好之后是不可以更改字段类型的。所以直接将mapping从long改为string(text)、或者增加keyword排序是不行的。

因此修改mapping的方式有两种：

正规军方案：新增字段

在mapping中新增加一个字段，废弃原油字段。但是字段不支持rename，因此会花费很多时间和前后端沟通。

野战军方案：利用别名

别名可以理解成增加一个逻辑层。例如，index A（es物理表）对应别名cco_dw。此时可以新建一个index B（es物理表），构建正确的mapping后将index A中的数据同步进来，然后将别名cco_dw下挂表换为index B。即从index A -> cco_dw变为index B -> cco_dw。对于后端来说使用的是cco_dw。

{"actions" : [{ "remove" : { "index" : "A", "alias" : "cco_dw" } },{ "add" : { "index" : "B", "alias" : "cco_dw" } }]
}

实现数据同步的方法：reindex

POST _reindex
{"max_docs": 10000,"source": {"index": "test_order"},"dest": {"index": "test"}
}

PS：如果实时数据写入，切换过程中可能丢失部分数据流。因此：

在低流量时进行变更
变更过程实时任务回追点位，避免数据丢失

3.3 条件删除

POST test/_delete_by_query  
{"query":{"match":{"product":"ding"}}
}

4 简单查询

4.1 查询所有(match_all)

match_all关键字: 返回索引中的全部文档

GET /ems/_search
{"query": { "match_all": {} }
}

4.2 查询结果中返回指定条数(size)

size 关键字: 指定查询结果中返回指定条数。

GET /ems/_search
{"query": { "match_all": {} },"size": 1
}

4.3 分页查询(from)

from 关键字: 用来指定起始返回位置
size关键字连用可实现分页效果，size表示从起始位置开始的文档数量；类似于mysql中的select * from tablename limit 1, 2;ES默认的分页深度是10000，也就是from+size超过了10000就会报错，ES内部是通过index.max_result_window这个参数控制分页深度的，可进行修改。分页越深，ES的处理开销越大，占用内存越大。
解决上面深度分页问题可使用scroll 或 search after，具体参考, 缺点是不能跳页(如从1页直接到第5页)，只能一页一页翻。

GET /ems/_search
{"query": {"match_all": {}},"sort": [{"age": {"order": "desc"}}],"size": 2, "from": 1
}

4.4 查询结果中返回指定字段(_source)

GET /ems/_search
{"query": { "match_all": {} },"_source": ["name", "age"]
}

4.5 关键词查询(term)

term 关键字: 用来使用关键词查询

通过使用term查询得知ES中默认使用分词器为标准分词器(StandardAnalyzer),标准分词器对于英文单词分词,对于中文单字分词。
通过使用term查询得知,在ES的Mapping Type 中 keyword , date ,integer, long , double , boolean or ip 这些类型不分词，只有text类型分词。

GET /ems/_search
{"query": {"term": {"address": {"value": "北京"}}}
}

4.6 范围查询(range)

range 关键字: 用来指定查询指定范围内的文档

GET /ems/_search
{"query": {"range": {"age": {"gte": 8,"lte": 30}}}
}

4.7 前缀查询(prefix)

prefix 关键字: 用来检索含有指定前缀的关键词的相关文档

GET /ems/_search
{"query": {"prefix": {"content": {"value": "redis"}}}
}

4.8 通配符查询(wildcard)

wildcard 关键字: 通配符查询，? 用来匹配一个任意字符，* 用来匹配多个任意字符，注意全模糊wildcard会有性能问题，具体参考。
wildcard query应杜绝使用通配符打头，实在不得已要这么做，就一定需要限制用户输入的字符串长度。最好换一种实现方式，通过在index time做文章，选用合适的分词器，比如nGram tokenizer预处理数据，然后使用更廉价的term query来实现同等的模糊搜索功能。对于部分输入即提示的应用场景，可以考虑优先使用completion suggester, phrase/term suggeter一类性能更好,模糊程度略差的方式查询，待suggester没有匹配结果的时候，再fall back到更模糊但性能较差的wildcard, regex, fuzzy一类的查询。

GET /ems/_search
{"query": {"wildcard": {"content": {"value": "re*"}}}
}

4.9 多id查询(ids)

ids 关键字 : 值为数组类型,用来根据一组id获取多个对应的文档

GET  /ems/_search
{"query": {"ids": {"values": ["lOiUsHUBe6kjTlxcqX3c","lQ5HwWkBxH7z6xax7W3_"]}}
}

4.10 模糊查询(fuzzy)

fuzzy 关键字: 用来模糊查询含有指定关键字的文档注意:允许出现的错误必须在0-2之间

GET /ems/_search
{"query": {"fuzzy": {"content":"spoong"}}
}
# 注意: 最大编辑距离为 0 1 2
如果关键词为2个长度      0..2 must match exactly  必须完全匹配
如果关键词长度3..5之间  one edit allowed    允许一个失败
如果关键词长度>5   two edits allowed       最多允许两个错误

4.11 布尔查询(bool)

bool 关键字: 用来组合多个条件实现复杂查询 boolb表达式查询
must: 相当于&& 同时成立
should: 相当于|| 成立一个就行
must_not: 相当于! 不能满足任何一个

GET /ems/_search
{"query": {"bool": {"must": [{"range": {"age": {"gte": 0,"lte": 30}}}],"must_not": [{"wildcard": {"content": {"value": "redi?"}}}]}},"sort": [{"age": {"order": "desc"}}]
}

4.12 高亮查询(highlight)

highlight 关键字: 可以让符合条件的文档中的关键词高亮

GET /ems/_search
{"query": {"term": {"content": {"value": "redis"}}},"highlight": {"fields": {"*": {}}}
}

自定义高亮html标签: 设置高亮html标签，默认是> _标签，可以在highlight中使用pre_tags和post_tags属性自定义高亮显示的html标签，去替代默认的em标签。

GET /ems/_search
{"query":{"term":{"content":"spring"}},"highlight": {"pre_tags": ["<span style='color:red'>"],"post_tags": ["</span>"],"fields": {"*":{}}}
}

_多字段高亮使用require_field_match设置为false，开启多个字段高亮，默认为true。

GET /ems/_search
{"query":{"term":{"content":"spring"}},"highlight": {"pre_tags": ["<span style='color:red'>"],"post_tags": ["</span>"],"require_field_match":false,"fields": {"*":{}}}
}

4.13 多字段查询(multi_match)

注意:使用这种方式进行查询时,为了更好获取搜索结果,在查询过程中先将查询条件根据当前的分词器分词之后进行查询

GET /ems/_search
{"query": {"multi_match": {"query": "中国","fields": ["name","content"] #这里写要检索的指定字段}}
}

4.14 多字段分词查询(query_String)

注意:使用这种方式进行查询时,为了更好获取搜索结果,在查询过程中先将查询条件根据当前的分词器分词之后进行查询

GET /dangdang/book/_search
{"query": {"query_string": {"query": "中国声音","analyzer": "ik_max_word", "fields": ["name","content"]}}
}

4.15 精准查询(match_phrase)

精准查询确切的phase，在对查询字段定义了分词器的情况下，会使用分词器对输入进行分词，然后返回满足下述两个条件的document:
match_phase中的所有term都出现在待查询字段之中
待查询字段之中的所有term都必须和match_phase具有相同的顺序

GET /ems/_search
{"query": {"match_phrase": {"name": "Hello World"}}
}

5 过滤查询

其实准确来说，ES中的查询操作分为2种: 查询(query)和过滤(filter)。查询即是之前提到的query查询，它 (查询)默认会计算每个返回文档的得分，然后根据得分排序。而过滤(filter)只会筛选出符合的文档，并不计算得分，且它可以缓存文档。所以，单从性能考虑，过滤比查询更快。换句话说，过滤适合在大范围筛选数据，而查询则适合精确匹配数据。一般应用时，应先使用过滤操作过滤数据，然后使用查询匹配数据。

5.1 过滤语法

GET /ems/_search
{"query": {"bool": {"must": [{"match_all": {}}],"filter": {"range": {"age": {"gte": 10}}}}}
}

NOTE: 在执行filter和query时,先执行filter在执行query{}

NOTE: Elasticsearch会自动缓存经常使用的过滤器，以加快性能。

5.1 term、terms

含义与查询时一致，term用于精确匹配，terms用于多词条匹配，过滤上使用没有很大区别

GET /ems/_search   # 使用term过滤
{"query": {"bool": {"must": [{"term": {"name": {"value": "小黑"}}}],"filter": {"term": {"content":"spring"}}}}
}
GET /ems/_search  #使用terms过滤
{"query": {"bool": {"must": [{"term": {"name": {"value": "梅超风"}}}],"filter": {"terms": {"content":["redis","开源"]}}}}
}

5.2 ranage filter

GET /ems/_search
{"query": {"bool": {"must": [{"term": {"name": {"value": "中国"}}}],"filter": {"range": {"age": {"gte": 7,"lte": 20}}}}}
}

5.3 exists filter

过滤存在指定字段,获取字段不为空的索引记录使用

GET /ems/_search
{"query": {"bool": {"must": [{"term": {"name": {"value": "中国"}}}],"filter": {"exists": {"field":"aaa"}}}}
}

5.4 ids filter

过滤含有指定字段的索引记录

GET /ems/_search
{"query": {"bool": {"must": [{"term": {"name": {"value": "中国"}}}],"filter": {"ids": {"values": ["1","2","3"]}}}}
}

6 排序

6.1 基础字段排序

详细参考

#索引结构
PUT /my-index-000001
{"mappings": {"properties": {"post_date": { "type": "date" },"user": {"type": "keyword"},"name": {"type": "keyword"},"age": { "type": "integer" }}}
}#基础字段按post_date升序，name降序，age降序查询。
GET /my-index-000001/_search
{"sort" : [{ "post_date" : {"order" : "asc"}},"user",{ "name" : "desc" },{ "age" : "desc" },"_score"],"query" : {"term" : { "user" : "kimchy" }}
}

6.2 嵌套字段排序

详细参考

#按嵌套字段中price升序排序。
POST /_search
{"query" : {"term" : { "product" : "chocolate" }},"sort" : [{"offer.price" : {"mode" :  "avg","order" : "asc","nested": {"path": "offer","filter": {"term" : { "offer.color" : "blue" }}}}}]
}

7 聚合查询

7.1 Bucket Aggregations

Bucket可以理解为一个桶，它会遍历文档中的内容，凡是符合某一要求的就放在一个桶中，分桶相当于sql中的group by, 关键字有Terms Aggregation，Filter Aggregation，Histogram Aggregation， Date Aggregation

#创建索引类型
PUT /cars
{"mappings": { "properties": {"price": {"type": "long"},"color": {"type": "keyword"},"brand": {"type": "keyword"},"sellTime": {"type": "date"}} }
}
#添加数据
POST /cars/_bulk
{ "index": {}}
{ "price" : 80000, "color" : "red", "brand" : "BMW", "sellTime" : "2014-01-28" }
{ "index": {}}
{ "price" : 85000, "color" : "green", "brand" : "BMW", "sellTime" : "2014-02-05" }
{ "index": {}}
{ "price" : 120000, "color" : "green", "brand" : "Mercedes", "sellTime" : "2014-03-18" }
{ "index": {}}
{ "price" : 105000, "color" : "blue", "brand" : "Mercedes", "sellTime" : "2014-04-02" }
{ "index": {}}
{ "price" : 72000, "color" : "green", "brand" : "Audi", "sellTime" : "2014-05-19" }
{ "index": {}}
{ "price" : 60000, "color" : "red", "brand" : "Audi", "sellTime" : "2014-06-05" }
{ "index": {}}
{ "price" : 40000, "color" : "red", "brand" : "Audi", "sellTime" : "2014-07-01" }
{ "index": {}}
{ "price" : 35000, "color" : "blue", "brand" : "Honda", "sellTime" : "2014-08-12" }

7.2 Terms Aggregation

Terms Aggregation关键字:** 根据某一项的每个唯一的值来聚合

GET /cars/_search
{"aggs": {"car_brand": {"terms": {"field": "brand"}}}
}
#分桶后只显示文档数量的前3的桶
GET /cars/_search
{"aggs": {"car_brand": {"terms": {"field": "brand","size": 3}}}
}
#分桶后排序
GET /cars/_search
{"aggs": {"car_brand": {"terms": {"field": "brand","order": {"_count": "asc"}}}}
}
#显示文档数量大于3的桶
GET /cars/_search
{"aggs": {"brands_max_num": {"terms": {"field": "brand","min_doc_count": 3}}}
}
#使用精确指定的词条进行分桶
GET /cars/_search
{"aggs": {"brand_cars": {"terms": {"field": "brand","include": ["BMW", "Audi"]}}}
}

7.3 Filter Aggregation

Filter Aggregation关键字: 指具体的域和具体的值，可以在Terms Aggregation 的基础上进行了过滤，只对特定的值进行了聚合

#过滤获取品牌为BMW的桶，并求该桶平均值
GET /cars/_search
{"aggs": {"car_brands": {"filter": {"term": {"brand": "BMW"}},"aggs": {"avg_price": {"avg": {"field": "price"}}}}}
}

Filters Aggregation关键字: Filter Aggregation 只能指定一个过滤条件，响应也只是单个桶。如果要对特定多个值进行聚合，使用Filters Aggragation

#过滤获取品牌为BMW的或color为绿色的桶
GET /cars/_search
{"aggs": {"cars": {"filters": {"filters": {"colorBucket":{"match":{"color":"red"}},"brandBucket":{"match":{"brand":"Audi"}}}}}}
}

7.4 Histogram Aggregation

Histogram Aggregation关键字: Histogram与Terms聚合类似，都是数据分组，区别是Terms是按照Field的值分组，而Histogram可以按照指定的间隔对Field进行分组

#根据价格区间为10000分桶
GET /cars/_search
{"aggs": {"prices": {"histogram": {"field": "price","interval": 10000}}}
}
#根据价格区间为10000分桶，同时如果桶中没有文档就不显示桶
GET /cars/_search
{"aggs": {"prices": {"histogram": {"field": "price","interval": 10000,"min_doc_count": 1}}}
}

7.5 Range Aggregation

Range Aggregation关键字: 根据用户传递的范围参数作为桶，进行相应的聚合。在同一请求中，请求传递多组范围，每组范围作为一个桶

#根据价格区间分桶
GET /cars/_search
{"aggs": {"prices_range": {"range": {"field": "price","ranges": [{"to":50000},{"from": 50000,"to": 80000},{"from": 80000}]}}}
}
#也可以指定key的名称
GET /cars/_search
{"aggs": {"prices_range": {"range": {"field": "price","ranges": [{"key": "<50000", "to":50000},{"key": "50000~80000", "from": 50000,"to": 80000},{"key": ">80000", "from": 80000}]}}}
}

7.6 Date Aggregation

Date Aggregation关键字: 分为Date Histogram Aggregation 和 Date Range Aggregation

1. Date Histogram

Date Histogram关键字: 针对时间格式数据的直方图聚合，基本特性与Histogram Aggregation一致

#按月分桶显示每个月的销量
GET /cars/_search
{"aggs": {"sales_over_time": {"date_histogram": {"field": "sellTime","interval": "month","format": "yyyy-MM-dd"}}}
}

2. Date Range

Date Range关键字: 针对时间格式数据的直范围聚合，基本特性与Range Aggregation一致

GET /cars/_search
{"aggs": {"range": {"date_range": {"field": "sellTime","format": "yyyy", "ranges": [{"from": "2014","to": "2019"}]}}}
}

8 搜索模板

如果是java用户，用过velocity模板会比较清楚，就是指定模板和对应参数即可生成实际的数据。先来看一个入门的使用方式，inline 和之前的脚本类似，直接写模板。

以下示例会替换field,value为实际值再进行搜索。
GET /blog_website/_search/template
{"inline":{"query": {"match": {"{{field}}": "{{value}}"}}},"params": {"field": "content","value": "博客"}
}

8.1 toJson

限制：inline 的内容只能在一行上

GET /blog_website/_search/template
{"inline": "{\"query\": {\"match\": {{#toJson}}matchCondition{{/toJson}}}}","params": {"matchCondition":{"content":"博客"}}
}

8.2 join

作用：把一个数组转为具体分隔符的字符串连接起来
如下效果：会吧 titles 数组转成「博客网站」，delimiter 规定了连接符是什么

GET /blog_website/blogs/_search/template
{"inline": {"query": {"match": {"title": "{{#join delimiter=' '}}titles{{/join delimiter=' '}}"}}},"params": {"titles": ["博客", "网站"]}
}

以上模板渲染后会变成以下语法

GET /blog_website/blogs/_search
{"query": {"match" : {"title" : "博客 网站"}}
}

8.3 default value

增加一个 views 字段

POST /blog_website/blogs/1/_update
{"doc": {"views": 5}
}

GET /blog_website/blogs/_search/template
{"inline": {"query": {"range": {"views": {"gte": "{{start}}","lte": "{{end}}{{^end}}20{{/end}}"}}}},"params": {"start": 1,"end": 10}
}

如上指定了两个参数，并使用

{{^end}}20

指定了 end 的默认值为 20，当 params.end 没有指定的之后，就会使用默认值 20

8.4 conditional

插入一条数据

POST /my_index/my_type/10
{"line":"我的博客","line_no": 5
}

查询语法

GET /my_index/_search/template
{"file": "conditional","params": {"text": "博客","line_no": true,"start": 1,"end": 10}
}

看到 file 就知道需要事先准备好模板文件了，文件名以后缀 .mustache 结尾
config\scripts\conditonal.mustache

{"query": {"bool": {"must": {"match": {"line": "{{text}}"}},"filter": {{{#line_no}}"range": {"line_no": {{{#start}}"gte": "{{start}}"{{#end}},{{/end}}{{/start}}{{#end}}"lte": "{{end}}"{{/end}}}}{{/line_no}}}}}
}

这个意思是要对应 params 里面的参数来看，#line_no 以「#」开头的为条件判定语法，只要存在该参数，即打开对应的模板条件
添加文件之后，记得重启 es

适应场景

主要是复用，比如说，一般在大型的团队中，可能不同的人，都会想要执行一些类似的搜索操作，这个时候，有一些负责底层运维的一些同学，就可以基于搜索模板search template，封装一些模板出来，放在各个 es 进程的 scripts 目录下，其他的团队，其实就不用各个团队自己反复手写复杂的通用的查询语句了，直接调用某个搜索模板，传入一些参数就好了。