ElasticSearch7.3 学习之定制分词器（Analyzer）

文章目录 显示

Python微信订餐小程序课程视频

https://edu.csdn.net/course/detail/36074

Python实战量化交易理财系统

https://edu.csdn.net/course/detail/35475

1、默认的分词器

关于分词器，前面的博客已经有介绍了，链接：ElasticSearch7.3 学习之倒排索引揭秘及初识分词器(Analyzer)。这里就只介绍默认的分词器standard analyzer

2、修改分词器的设置

首先自定义一个分词器es_std。启用english停用词token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

5f985aaa3cc1f4b615acb7f890acd34d - ElasticSearch7.3 学习之定制分词器（Analyzer）

接下来开始测试两种不同的分词器，首先是默认的分词器

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

返回结果

{
 "tokens" : [
 {
 "token" : "a",
 "start\_offset" : 0,
 "end\_offset" : 1,
 "type" : "",
 "position" : 0
 },
 {
 "token" : "dog",
 "start\_offset" : 2,
 "end\_offset" : 5,
 "type" : "",
 "position" : 1
 },
 {
 "token" : "is",
 "start\_offset" : 6,
 "end\_offset" : 8,
 "type" : "",
 "position" : 2
 },
 {
 "token" : "in",
 "start\_offset" : 9,
 "end\_offset" : 11,
 "type" : "",
 "position" : 3
 },
 {
 "token" : "the",
 "start\_offset" : 12,
 "end\_offset" : 15,
 "type" : "",
 "position" : 4
 },
 {
 "token" : "house",
 "start\_offset" : 16,
 "end\_offset" : 21,
 "type" : "",
 "position" : 5
 }
 ]
}

可以看到就是简单的按单词进行拆分，在接下来测试上面自定义的一个分词器es_std

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "",
 "position" : 1
 },
 {
 "token" : "house",
 "start\_offset" : 16,
 "end\_offset" : 21,
 "type" : "",
 "position" : 5
 }
 ]
}

可以看到结果只有两个单词了，把停用词都给去掉了。

3、定制化自己的分词器

首先删除掉上面建立的索引

DELETE my_index

然后运行下面的语句。简单说下下面的规则吧，首先去除html标签，把&转换成and，然后采用standard进行分词，最后转换成小写字母及去掉停用词a the，建议读者好好看看，下面我也会对这个分词器进行测试。

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

老规矩，测试这个分词器

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, , HAHA!!"
}

结果如下：

{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "",
 "position" : 0
 },
 {
 "token" : "are",
 "start\_offset" : 10,
 "end\_offset" : 13,
 "type" : "",
 "position" : 1
 },
 {
 "token" : "friend",
 "start\_offset" : 16,
 "end\_offset" : 22,
 "type" : "",
 "position" : 3
 },
 {
 "token" : "in",
 "start\_offset" : 23,
 "end\_offset" : 25,
 "type" : "",
 "position" : 4
 },
 {
 "token" : "house",
 "start\_offset" : 30,
 "end\_offset" : 35,
 "type" : "",
 "position" : 6
 },
 {
 "token" : "haha",
 "start\_offset" : 42,
 "end\_offset" : 46,
 "type" : "",
 "position" : 7
 }
 ]
}

最后我们可以在实际使用时设置某个字段使用自定义分词器，语法如下：

PUT /my_index/_mapping/
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

如果您觉得阅读本文对您有帮助，请点一下“**推荐**”按钮，您的**“推荐”**将是我最大的写作动力！欢迎各位转载，但是未经作者本人同意，转载文章之后**必须在文章页面明显位置给出作者和原文连接**，否则保留追究法律责任的权利。

转载请注明：xuhss » ElasticSearch7.3 学习之定制分词器（Analyzer）

xuhss 简单、充实的学习

ElasticSearch7.3 学习之定制分词器（Analyzer）

Python微信订餐小程序课程视频

Python实战量化交易理财系统

1、默认的分词器

2、修改分词器的设置

3、定制化自己的分词器

您必须登录才能发表评论！

Python微信订餐小程序课程视频

Python实战量化交易理财系统

1、默认的分词器

2、 修改分词器的设置

3、定制化自己的分词器

您必须 登录 才能发表评论！

2、修改分词器的设置

您必须登录才能发表评论！