现代的搜索引擎,一般都会提供 Suggest as you type 的功能,帮助用户在输入搜索的过程中,进行自动补全或者纠错。通过协助用户输入更加精准的关键词,提高后续搜索阶段文档匹配的程度。在 google 上搜索,一开始会自动补全。当输入到一定长度,如因为单词拼写错误无法补全,就会开始提示相似的词或者句子。
官网6.8版本地址:https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters.html
搜索引擎中类似的功能,在 ES 中通过 Sugester API 实现的
Suggester 就是一种特殊类型的搜索。“text” 里是调用时候提供的文本,通常来自用户界面上用户输入的内容。用户输入的 “lucen” 是一个错误的拼写会到 指定的字段 “body” 上搜索,当无法搜索到结果时(missing),返回建议的词。
missing
: Only provide suggestions for suggest text terms that are not in the index. This is the default。仅在搜索的词项在索引中不存在时才提供建议词,默认值popular
: Only suggest suggestions that occur in more docs than the original suggest text term。仅建议文档频率比搜索词项高的词always
: Suggest any matching suggestions based on terms in the suggest text。总是提供匹配的建议词internal
: The default based on damerau_levenshtein but highly optimized for comparing string distance for terms inside the index.damerau_levenshtein
: String distance algorithm based on Damerau-Levenshtein algorithm.levenshtein
: String distance algorithm based on Levenshtein edit distance algorithm.jaro_winkler
: String distance algorithm based on Jaro-Winkler algorithm.ngram
: String distance algorithm based on character n-grams.PUT /suggest_article/
{
"mappings": {
"_doc": {
"properties": {
"body": {
"type": "text"
}
}
}
}
}
PUT suggest_article/_doc/1
{
"body":"lucene is very cool"
}
"body":"Elasticsearch builds on top of lucene"
"body":"Elasticsearch rocks"
"body":"elastic is the company behind ELK stack"
"body":"Elk stack rocks"
"body":"elasticsearch is rock solid"
Search API
POST suggest_article/_search
{
"from": 0,
"size": 10,
"query": {
"match": {
"body": "lucen rock"
}
},
"suggest": {
"term-suggestion": {
"text": "lucen rock",
"term": {
"suggest_mode": "missing", // popular always
"field": "body"
}
}
}
}
备注:中文查询时,查询分词使用简单分词器 "analyzer": "simple",不会因为查询分词而把搜索词进行分词
结果:{
"took" : 38,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.1149852,
"hits" : [
{
"_index" : "suggest_article",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.1149852,
"_source" : {
"body" : "elasticsearch is rock solid"
}
}
]
},
"suggest" : {
"term-suggestion" : [
{
"text" : "lucen",
"offset" : 0,
"length" : 5,
"options" : [
{
"text" : "lucene",
"score" : 0.8,
"freq" : 2
}
]
},
{
"text" : "rock",
"offset" : 6,
"length" : 4,
"options" : [
{
"text" : "rocks",
"score" : 0.75,
"freq" : 2
}
]
}
]
}
}
Java API:
推荐词请求体:keyword为搜索框输入内容
SuggestionBuilder termSuggestionBuilder = SuggestBuilders.termSuggestion("body").text(keyword);
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("term-suggestion", termSuggestionBuilder);
builder.suggest(suggestBuilder);
推荐词响应结构:
Suggest suggest = searchResponse.getSuggest();
TermSuggestion termSuggestion = suggest.getSuggestion("trem-suggestion");
for (TermSuggestion.Entry entry : termSuggestion.getEntries()) {
for (TermSuggestion.Entry.Option option : entry) {
String suggestText = option.getText().string();//建议内容
float score = option.getScore();
}
}
备注:支持多个字段分别给出提示
小结:term suggester首先将输入文本经过分析器(所以,分析结果由于采用的分析器不同而有所不同)分析,处理为单个词条,然后根据单个词条去提供建议,并不会考虑多个词条之间的关系。然后将每个词条的建议结果(有或没有)封装到options列表中。最后由推荐器统一返回。term suggester定位的是term,而不是doc,主要是纠错。
Phrase suggester在 Term suggester 的基础上添加额外的逻辑以选择整个经校正的短语,而不是基于 ngram-language 模型加权的单个 token。会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等等。在实践中,这个 suggester 将能够基于同现和频率来做出关于选择哪些 token 的更好的决定。
phrase 短语建议,在term的基础上,会考量多个term之间的关系,⽐如是否同时出现在索引的原⽂⾥,相邻程度,以及词频等。
field
. If the field doesn’t contain n-grams (shingles), this should be omitted or set to1
. Note that Elasticsearch tries to detect the gram size based on the specifiedfield
. If the field uses ashingle
filter, thegram_size
is set to themax_shingle_size
if not explicitly set.设置在field
中连词的最大数值,如果这个字段不包含连词应该可以被忽略或者直接设置为1,注意ES会尝试基于特定的field
字段检测连词的长度,这个字段用了shingle
过滤器,如果没有显式指定那它的gram_size
将会被设置为max_shingle_size
;0.95
, meaning 5% of the real words are misspelled.即使该term存在于字典中,该term也会被拼错。默认值为0.95,表示5%的真实单词拼写错误。1.0
will only return suggestions that score higher than the input phrase. If set to0.0
the top N candidates are returned. The default is1.0
.置信水平定义了应用于输入短语分数的因子,该因子用作 suggest 候选者的阈值。返回的result中仅包含得分高于阈值的候选人。例如,置信度为1.0只会返回得分高于输入短语的 suggest 。如果设置为0.0,则返回前N个候选者。默认值为1.0。[0..1)
as a fraction of the actual query terms or a number>=1
as an absolute number of query terms. The default is set to1.0
, meaning only corrections with at most one misspelled term are returned. Note that setting this too high can negatively impact performance. Low values like1
or2
are recommended; otherwise the time spend in suggest calls might exceed the time spend in query execution。术语(为了形成修正大多数认为拼写错误)的最大百分比,这个参数可以接受[0,1)范围内的小数作为实际查询项的一部分,也可以是大于等于1的绝对数。默认值为1.0,与最多1对应,只有修正拼写错误返回,注意这个参数设置太高将会影响ES性能,推荐使用像1或2这样较小的数值,否则时间花在建议调用可能超过花在查询执行的时间。field
.query
to prune suggestions for which no matching docs exist in the index. The collate query for a suggestion is run only on the local shard from which the suggestion has been generated from. Thequery
must be specified and it can be templated, seesearch templatesfor more information. The current suggestion is automatically made available as the{{suggestion}}
variable, which should be used in your query. You can still specify your own templateparams
— thesuggestion
value will be added to the variables you specify. Additionally, you can specify aprune
to control if all phrase suggestions will be returned; when set totrue
the suggestions will have an additional optioncollate_match
, which will betrue
if matching documents for the phrase was found,false
otherwise. The default value forprune
isfalse
.Search API:
POST suggest_article/_search
{
"suggest": {
"phrase-suggestion": {
"text": "lucne and elasticsear rock",
"phrase": {
"field": "body",
"max_errors":2, # 最多可以拼错的terms
"confidence":0,
"direct_generator":[{
"field":"body",
"suggest_mode":"always"
}],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}
{
"took" : 99,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"phrase-suggestion" : [
{
"text" : "lucne and elasticsear rock",
"offset" : 0,
"length" : 26,
"options" : [
{
"text" : "lucne and elasticsearch rocks",
"highlighted" : "lucne and <em>elasticsearch rocks</em>",
"score" : 0.12709484
},
{
"text" : "lucne and elasticsearch rock",
"highlighted" : "lucne and <em>elasticsearch</em> rock",
"score" : 0.10422645
},
{
"text" : "lucne and elasticsear rocks",
"highlighted" : "lucne and elasticsear <em>rocks</em>",
"score" : 0.10036137
},
{
"text" : "lucne and elasticsear rock",
"highlighted" : "lucne and elasticsear rock",
"score" : 0.082303174
},
{
"text" : "lucene and elasticsear rock",
"highlighted" : "<em>lucene</em> and elasticsear rock",
"score" : 0.030959692
}
]
}
]
}
}
自定义高亮:"pre_tag":"<b id='d1' class='t1' style='color:red;font-size:18px;'>", "post_tag":"</b>"
注意:推荐器结果的高亮显示和查询结果高亮显示有些许区别
比如说,这里的自定义标签是pre_tag和post_tag而不是之前的pre_tags和post_tags
Java API:
推荐词请求体:keyword为搜索框输入内容
SearchSourceBuilder builder = new SearchSourceBuilder();
PhraseSuggestionBuilder phraseSuggestBuilder = SuggestBuilders.phraseSuggestion("body")
.text(keyword)
.highlight("<em>", "</em>")
.maxErrors(2) // 最多可以拼错的 Terms 数
.analyzer("simple")
.confidence(0)
.size(10);
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("phrase-suggestion", phraseSuggestBuilder);
builder.suggest(suggestBuilder);
备注:支持多个字段进行推荐,只需new多个PhraseSuggestionBuilder即可
推荐词响应结构:public static Map<String, List> suggestPhraseResponse(SearchResponse result) {
Map<String, List> resultResponse = Maps.newHashMap();
List suggestions = Lists.newArrayList();
List<Map<String,Object>> hits = Lists.newArrayList();
if (null != result) {
Iterator<SearchHit> iterator = result.getHits().iterator();
while (iterator.hasNext()) {
Map<String, Object> hit = new HashMap<>();
SearchHit searchHit = iterator.next();
hit.put("matches", searchHit.getSourceAsMap());
hit.put("score", searchHit.getScore());
hits.add(hit);
}
Suggest suggest = result.getSuggest();
PhraseSuggestion phraseSuggestion =suggest.getSuggestion("suggestion");
for (PhraseSuggestion.Entry entry : phraseSuggestion){
for (PhraseSuggestion.Entry.Option option : entry){
Map<String, Object> optionMap = Maps.newHashMap();
String text = option.getText().string();
float score = option.getScore();
String highlighted = option.getHighlighted().string();
optionMap.put("text", text);
optionMap.put("score", score);
optionMap.put("highlighted", highlighted);
suggestions.add(optionMap);
}
}
}
resultResponse.put("hits", hits);
resultResponse.put("suggestions", suggestions);
return resultResponse;
}
词组 suggester 支持多种平滑模型,以在不常见的gram和频繁的gram(索引中至少出现一次)之间权衡权重。可以通过将平滑参数设置为以下选项之一来选择平滑模型。每个平滑模型都支持可以配置的特定属性。
Thephrase
suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). The smoothing model can be selected by setting thesmoothing
parameter to one of the following options. Each smoothing model supports specific properties that can be configured.
POST _search
{
"suggest": {
"text" : "obel prize",
"simple_phrase" : {
"phrase" : {
"field" : "title.trigram",
"size" : 1,
"smoothing" : {
"laplace" : {
"alpha" : 0.7
}
}
}
}
}
}
phrase suggester 使用 generator 来生成给定text中每个term的可能提示term列表。单个generator就好像为文本中的每个term调用的term suggester。随后,多个generator 对这个term的打分进行组合评分。
当前仅支持一种类型的generator:direct_generator
。phrase suggest API接受关键字direct_generator下的generator列表;列表中的每个generator在原始文本中均按term被调用。
The phrase
suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a term
suggester called for each individual term in the text. The output of the generators is subsequently scored in combination with the candidates from the other terms for suggestion candidates.
Currently only one type of candidate generator is supported, the direct_generator
. The Phrase suggest API accepts a list of generators under the key direct_generator
; each of the generators in the list is called per term in the original text.
下面的示例显示了具有两个generator的词组 suggest 调用:第一个generator使用包含普通索引项的字段,第二个generator使用包含使用反向过滤器索引的项的字段(token按相反顺序索引)。这用于克服直接generator的局限性,即它要求常量前缀以提供高性能 suggest 。 pre_filter和post_filter选项接受普通的分析器名称。
The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.
POST _search
{
"suggest": {
"text" : "obel prize",
"simple_phrase" : {
"phrase" : {
"field" : "title.trigram",
"size" : 1,
"direct_generator" : [ {
"field" : "title.trigram",
"suggest_mode" : "always"
}, {
"field" : "title.reverse",
"suggest_mode" : "always",
"pre_filter" : "reverse",
"post_filter" : "reverse"
} ]
}
}
}
}
总结:phrase suggester对中文的支持不太友好,中文查询时,查询分词使用简单分词器 "analyzer":"simple",不会因为查询分词而把搜索词进行分词。
完全(completion)suggester提供自动完成/按需搜索功能。 这是一种导航功能,可在用户输入时引导用户查看相关结果,从而提高搜索精度。 它不是用于拼写校正或平均值功能,如术语或短语suggesters 。
理想地,自动完成功能应当与用户键入的速度一样快,以提供与用户已经键入的内容相关的即时反馈。因此,完成 suggester 针对速度进行优化。 suggester 使用允许快速查找的数据结构,但是构建成本高并且存储在内存中。
主要针对的应用场景就是"Auto Completion"。 此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在。
You index suggestions like any other field. A suggestion is made of aninput
and an optionalweight
attribute. Aninput
is the expected text to be matched by a suggestion query and theweight
determines how the suggestions will be scored. Indexing a suggestion is as follows:
PUT completion_article/_doc/1?refresh
{
"suggest" : {
"input": [ "Nevermind", "Nirvana" ],
"weight" : 34
}
}
您可以按如下所示为文档编制多个 suggestions:
PUT completion_article/_doc/1?refresh
{
"suggest" : [
{
"input": "Nevermind",
"weight" : 10
},
{
"input": "Nirvana",
"weight" : 3
}
]
}
您可以使用以下速记形式。 请注意,您不能使用suggestion指定权重。
PUT completion_article/_doc/1?refresh
{
"suggest" : [ "Nevermind", "Nirvana" ]
}
Suggesting works as usual, except that you have to specify the suggest type as completion
. Suggestions are near real-time, which means new suggestions can be made visible by refresh and documents once deleted are never shown. This request:
POST music/_suggest?pretty
{
"song-suggest" : {
"prefix" : "nir",
"completion" : {
"field" : "suggest"
}
}
}
响应结果:
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"song-suggest" : [ {
"text" : "nir",
"offset" : 0,
"length" : 3,
"options" : [ {
"text" : "Nirvana",
"_index": "music",
"_type": "song",
"_id": "1",
"_score": 1.0,
"_source": {
"suggest": ["Nevermind", "Nirvana"]
}
} ]
} ]
}
_source元字段必须启用,这是默认行为,以启用返回_source与suggestions
The configured weight for a suggestion is returned as_score
. Thetext
field uses theinput
of your indexed suggestion. Suggestions return the full document_source
by default. The size of the_source
can impact performance due to disk fetch and network transport overhead. To save some network overhead, filter out unnecessary fields from the_source
usingsource filteringto minimize_source
size. Note that the _suggest endpoint doesn’t support source filtering but using suggest on the_search
endpoint does:
POST music/_search?size=0
{
"_source": "suggest",
"suggest": {
"song-suggest" : {
"prefix" : "nir",
"completion" : {
"field" : "suggest"
}
}
}
}
PUT /completion_article/
{
"mappings": {
"_doc": {
"properties": {
"body": {
"type": "completion"
}
}
}
}
}
备注:要使用此功能,请为此字段指定一个特殊映射,为快速完成的字段值编制索引
1.body字段可以设置索引分词,这些会影响FST编码结果,也会影响查找匹配的效果
2.设置查询分词需要在mapping中添加才会生效
"type": "completion",
"analyzer": "trigram_analyzer",
"search_analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
PUT completion_article/_doc/1
{
"body":"lucene is very cool"
}
"body":"Elasticsearch builds on top of lucene"
"body":"Elasticsearch rocks"
"body":"elastic is the company behind ELK stack"
"body":"Elk stack rocks"
"body":"elasticsearch is rock solid"
Search API:
POST completion_article/_search
{ "size": 0,
"_source": {
"includes": [
"body"
],
"excludes": []
},
"suggest": {
"completion-suggest": {
"prefix": "elastic i",
"completion": {
"field": "body",
"skip_duplicates": true // 开启去重推荐词
}
}
}
}
返回结果:{
"took" : 42,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"completion-suggest" : [
{
"text" : "elastic i",
"offset" : 0,
"length" : 9,
"options" : [
{
"text" : "elastic is the company behind ELK stack",
"_index" : "completion_article",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"body" : "elastic is the company behind ELK stack"
}
}
]
}
]
}
}
Java API:
推荐词请求结构:
SuggestionBuilder termSuggestionBuilder = SuggestBuilders.completionSuggestion("body")
.prefix(keyword).skipDuplicates(true) //开启去重推荐词
.size(10);
String[] source = {"body"};
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("completion-suggest", termSuggestionBuilder);
builder.suggest(suggestBuilder).fetchSource(source, null);
推荐词响应结构:
if(RestStatus.OK.equals(searchResponse.status())) {
// 获取建议结果
Suggest suggest = searchResponse.getSuggest();
CompletionSuggestion termSuggestion = suggest.getSuggestion("song-suggest");
for (CompletionSuggestion.Entry entry : termSuggestion.getEntries()) {
for (CompletionSuggestion.Entry.Option option : entry) {
String suggestText = option.getText().string();
}
}
}
备注:如果要去重推荐词.skipDuplicates(true)
When set to true, this option can slow down search because more suggestions need to be visited to find the top N.
值得注意的一点是Completion Suggester在索引原始数据的时候也要经过analyze阶段,选用的analyzer不同,某些词可能会被转换或者某些词可能被去除,这些会影响FST编码结果,也会影响查找匹配的效果。
比如我们重新索引,设置索引的mapping,将analyzer更改为"english"
PUT /completion_article_analyzer/
{
"mappings": {
"_doc": {
"properties": {
"body": {
"type": "completion",
"analyzer": "english"
}
}
}
}
}
PUT completion_article_analyzer/_doc/6
{
"body":"elasticsearch is rock solid"
}
Search API:
POST completion_article_analyzer/_search
{ "size": 0,
"suggest": {
"completion_article": {
"prefix": "elastic i",
"completion": {
"field": "body"
}
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"completion_article" : [
{
"text" : "elastic i",
"offset" : 0,
"length" : 9,
"options" : [ ]
}
]
}
}
结果为null:因为我们选择的分词器为english analyzer会剥离掉stop word,而is就是其中一个,被剥离掉了,导致匹配i的时候没有匹配到
分析过程:
POST _analyze
{
"analyzer":"english",
"text": "elasticsearch is rock solid"
}
{
"tokens" : [
{
"token" : "elasticsearch",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "rock",
"start_offset" : 17,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "solid",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
FST只编码了这3个token,并且默认的还会记录他们在文档中的位置和分隔符。 用户输入"elastic i"进行查找的时候,输入被分解成"elastic"和"i",FST没有编码这个“i” , 匹配失败。
搜索"elastic is",会发现又有结果, 因为这次输入的text经过english analyzer的时候,在查询分词中is也被剥离了,只需在FST里查询"elastic"这个前缀,自然就可以匹配到了。
POST completion_article_analyzer/_search
{ "size": 0,
"suggest": {
"completion_article": {
"prefix": "elastic is",
"completion": {
"field": "body"
}
}
}
}
结果:
{
"took" : 17,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"completion_article" : [
{
"text" : "elastic is",
"offset" : 0,
"length" : 10,
"options" : [
{
"text" : "elastic is the company behind ELK stack",
"_index" : "completion_article_analyzer",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"body" : "elastic is the company behind ELK stack"
}
}
]
}
]
}
}
Completion Suggester 还支持模糊查询-这意味着您可以在搜索中输入错误,并且仍然可以得到结果。
POST music/_suggest?pretty
{
"song-suggest" : {
"prefix" : "nor",
"completion" : {
"field" : "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
模糊查询可以采用特定的模糊参数。 支持以下参数:
Completion Suggester 还支持正则表达式查询,这意味着您可以将前缀表示为正则表达式。
POST music/_suggest?pretty
{
"song-suggest" : {
"regex" : "n[ever|i]r",
"completion" : {
"field" : "suggest"
}
}
}
正则表达式查询可以使用特定的正则表达式参数。 支持以下参数:
总结:completion suggestion主要是以自动补全为目标,不会进行term纠错。
Completion Suggester 的扩展
我们可以在doc上加上分类信息,帮助精准推荐。
例如,输入 “维生素”
{
"indexName":"drug",
"indexSource":{
"settings":{
"number_of_shards":1,
"number_of_replicas":2,
"index":{
"analysis":{
"filter":{
"bigram_filter":{
"max_shingle_size":"2",
"min_shingle_size":"2",
"output_unigrams":"false",
"type":"shingle"
},
"trigram_filter":{
"max_shingle_size":"3",
"min_shingle_size":"2",
"type":"shingle"
},
"my_synonym":{
"type":"synonym",
"synonyms_path":"analysis/synonym.txt"
}
},
"analyzer":{
"trigram_analyzer":{
"filter":[
"lowercase",
"trigram_filter"
],
"type":"custom",
"tokenizer":"standard"
},
"index_ansj_analyzer":{
"filter":[
"my_synonym",
"asciifolding"
],
"type":"custom",
"tokenizer":"index_ansj"
},
"comma":{
"pattern":",",
"type":"pattern"
},
"lowercase_ngram_1_2":{
"filter":"lowercase",
"tokenizer":"ngram_1_2_tokenizer"
},
"bigram_analyzer":{
"filter":[
"lowercase",
"bigram_filter"
],
"type":"custom",
"tokenizer":"standard"
},
"pinyin_analyzer":{
"tokenizer":"my_pinyin"
}
},
"tokenizer":{
"my_pinyin":{
"lowercase":"true",
"keep_original":"false",
"keep_first_letter":"true",
"keep_separate_first_letter":"true",
"type":"pinyin",
"limit_first_letter_length":"16",
"keep_full_pinyin":"true",
"keep_none_chinese_in_joined_full_pinyin":"true",
"keep_joined_full_pinyin":"true"
},
"ngram_1_2_tokenizer":{
"token_chars":[
"letter",
"digit"
],
"min_gram":"1",
"type":"nGram",
"max_gram":"2"
}
}
}
}
},
"mappings":{
"properties":{
"categoryfirst":{
"type":"keyword"
},
"categorysecond":{
"type":"keyword"
},
"commonname":{
"type":"completion",
"analyzer":"trigram_analyzer",
"preserve_separators":true,
"preserve_position_increments":true,
"max_input_length":50,
"contexts":[
{
"type":"category",
"name":"spu_category"
}
],
"fields":{
"ansj":{
"type":"text",
"analyzer":"index_ansj_analyzer"
},
"text":{
"type":"text"
},
"pinyincompletion":{
"type":"completion",
"analyzer":"pinyin_analyzer",
"preserve_separators":true,
"preserve_position_increments":true,
"search_analyzer":"simple",
"max_input_length":50
},
"keyword":{
"type":"keyword"
},
"pinyin":{
"type":"text",
"boost":10,
"term_vector":"with_offsets",
"analyzer":"pinyin_analyzer"
},
"shingle":{
"type":"text",
"analyzer":"trigram_analyzer"
}
}
},
"ctime":{
"type":"date",
"format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
},
"doctorteamhotid":{
"type":"keyword"
},
"drugid":{
"type":"keyword"
},
"drugname":{
"type":"completion",
"analyzer":"trigram_analyzer",
"preserve_separators":true,
"preserve_position_increments":true,
"max_input_length":50,
"contexts":[
{
"type":"category",
"name":"spu_category"
}
],
"fields":{
"ansj":{
"type":"text",
"analyzer":"index_ansj_analyzer"
},
"text":{
"type":"text"
},
"pinyincompletion":{
"type":"completion",
"analyzer":"pinyin_analyzer",
"search_analyzer":"simple",
"preserve_separators":true,
"preserve_position_increments":true,
"max_input_length":50
},
"keyword":{
"type":"keyword"
},
"pinyin":{
"type":"text",
"boost":10,
"term_vector":"with_offsets",
"analyzer":"pinyin_analyzer"
},
"shingle":{
"type":"text",
"analyzer":"trigram_analyzer"
}
}
},
"drugtype":{
"type":"keyword"
},
"factoryname":{
"type":"text",
"fields":{
"ansj":{
"type":"text",
"analyzer":"index_ansj_analyzer"
},
"keyword":{
"type":"keyword"
},
"pinyin":{
"type":"text",
"boost":10,
"term_vector":"with_offsets",
"analyzer":"pinyin_analyzer"
},
"shingle":{
"type":"text",
"analyzer":"trigram_analyzer"
}
},
"copy_to":[
"text_all"
]
},
"id":{
"type":"keyword"
},
"included":{
"type":"keyword"
},
"indextype":{
"type":"keyword"
},
"iscfda":{
"type":"keyword"
},
"medicineaccuratenum":{
"type":"keyword",
"copy_to":[
"text_all"
]
},
"prescription":{
"type":"keyword"
},
"relation":{
"type":"join",
"eager_global_ordinals":true,
"relations":{
"drug-spu":[
"drug-doctorteamhot",
"drug-sku"
]
}
},
"text_all":{
"type":"text"
},
"utime":{
"type":"date",
"format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
}
PUT drug/_doc/310346585771
{
"indextype": "drug-spu",
"categoryfirst": "drug",
"iscfda": "1",
"utime": "2020-12-29 09:37:04",
"drugname": {
"input": "", // 单个的话使用字符串
"contexts": {
"spu_category": "drug"
}
},
"relation": "drug-spu",
"medicineaccuratenum": "国药准字Z20150067",
"commonname": {
"input": [
"维生素E乳膏" // 多个的话使用数组方式
],
"contexts": {
"spu_category": "drug"
}
},
"prescription": "0",
"ctime": "2020-12-23 22:00:18",
"id": "310346585771",
"categorysecond": "4171957ed83b25fa5727b1dd034eed50",
"included": "1",
"drugtype": "3",
"factoryname": "中国医学科学院皮肤病医院"
}
{
"indextype": "drug-spu",
"categoryfirst": "drug",
"iscfda": "1",
"utime": "2020-12-29 09:37:04",
"drugname": {
"input": [
""
],
"contexts": {
"spu_category": "supplement"
}
},
"relation": "drug-spu",
"medicineaccuratenum": "国药准字Z20150067",
"commonname": {
"input": [
"维生素D滴剂"
],
"contexts": {
"spu_category": "supplement"
}
},
"prescription": "0",
"ctime": "2020-12-23 21:45:39",
"id": "310346508974",
"categorysecond": "a1332770e38c9146e4376fff033fe715",
"included": "0",
"drugtype": "0",
"factoryname": ""
}
Search API
POST drug/_search
{
"_source": {
"includes": [
"commonname",
"drugname"
],
"excludes": []
},
"suggest": {
"commonname-completionsuggest": {
"prefix": "STC踝控",
"completion": {
"field": "commonname",
"size": 10,
"skip_duplicates": true,
"contexts": {
"spu_category": [
{
"context": "others",
"boost": 1,
"prefix": false
}
]
}
}
},
"drugname-completionsuggest": {
"prefix": "STC踝控",
"completion": {
"field": "drugname",
"size": 10,
"skip_duplicates": true,
"contexts": {
"spu_category": [
{
"context": "others",
"boost": 1,
"prefix": false
}
]
}
}
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"MY_SUGGESTION" : [
{
"text" : "维生素",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "维生素E乳膏",
"_index" : "drug-20.12.30-103610",
"_type" : "_doc",
"_id" : "310346585771",
"_score" : 1.0,
"_ignored" : [
"drugname.pinyincompletion",
"drugname"
],
"_source" : {
"indextype" : "drug-spu",
"categoryfirst" : "drug",
"iscfda" : "1",
"utime" : "2020-12-29 09:37:04",
"drugname" : {
"input" : [
""
],
"contexts" : {
"spu_category" : "drug"
}
},
"relation" : "drug-spu",
"medicineaccuratenum" : "国药准字Z20150067",
"commonname" : {
"input" : [
"维生素E乳膏"
],
"contexts" : {
"spu_category" : "drug"
}
},
"prescription" : "0",
"ctime" : "2020-12-23 22:00:18",
"id" : "310346585771",
"categorysecond" : "4171957ed83b25fa5727b1dd034eed50",
"included" : "1",
"drugtype" : "3",
"factoryname" : "中国医学科学院皮肤病医院"
},
"contexts" : {
"spu_category" : [
"drug"
]
}
}
]
}
]
}
}
Java API
String keyword = searchRequest.getKeyword();
String category = searchRequest.getCategory();
SearchSourceBuilder builder = new SearchSourceBuilder();
if (StringUtils.isNotBlank(keyword)) {
CompletionSuggestionBuilder commonnameBuilder = SuggestBuilders.completionSuggestion("commonname")
.prefix(keyword).skipDuplicates(true)
.size(10);
CompletionSuggestionBuilder drugnameBuilder = SuggestBuilders.completionSuggestion("drugname")
.prefix(keyword).skipDuplicates(true)
.size(10);
SuggestionBuilder pinyinDrugnameBuilder = SuggestBuilders.completionSuggestion("drugname.pinyincompletion")
.prefix(keyword).skipDuplicates(true)
.size(10);
SuggestionBuilder pinyinCommonnameBuilder = SuggestBuilders.completionSuggestion("commonname.pinyincompletion")
.prefix(keyword).skipDuplicates(true)
.size(10);
if (StringUtils.isNotBlank(category)) {
CategoryQueryContext context = CategoryQueryContext.builder()
.setBoost(1)
.setCategory(category)
.setPrefix(false).build();
Map categoryMap = Maps.newHashMap();
List categoryList = Lists.newArrayList();
categoryList.add(context);
categoryMap.put("spu_category", categoryList);
commonnameBuilder.contexts(categoryMap);
drugnameBuilder.contexts(categoryMap);
}
String[] source = {"commonname", "drugname"};
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("commonname-completionsuggest", commonnameBuilder)
.addSuggestion("drugname-completionsuggest", drugnameBuilder)
.addSuggestion("pinyincommonvame-completionsuggest", pinyinCommonnameBuilder)
.addSuggestion("pinyindrugname-completionsuggest", pinyinDrugnameBuilder);
builder.suggest(suggestBuilder).fetchSource(source, null);
}
一个geo上下文允许我们将一个或多个地理位置或geohash与在索引时间的建议关联,在查询时,如果建议位于地理位置特定的距离内,则可以过滤和提升建议。
在内部,地位置被编码为具有指定精度的地理位置。
精准程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。