ElasticSearch7.3 学习之定制分词器(Analyzer)

虚幻大学 xuhss 419℃ 0评论

Python微信订餐小程序课程视频

https://edu.csdn.net/course/detail/36074

Python实战量化交易理财系统

https://edu.csdn.net/course/detail/35475

1、默认的分词器

关于分词器,前面的博客已经有介绍了,链接:ElasticSearch7.3 学习之倒排索引揭秘及初识分词器(Analyzer)。这里就只介绍默认的分词器standard analyzer

2、 修改分词器的设置

首先自定义一个分词器es_std。启用english停用词token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

返回:

5f985aaa3cc1f4b615acb7f890acd34d - ElasticSearch7.3 学习之定制分词器(Analyzer)

接下来开始测试两种不同的分词器,首先是默认的分词器

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

返回结果

{
 "tokens" : [
 {
 "token" : "a",
 "start\_offset" : 0,
 "end\_offset" : 1,
 "type" : "",
 "position" : 0
 },
 {
 "token" : "dog",
 "start\_offset" : 2,
 "end\_offset" : 5,
 "type" : "",
 "position" : 1
 },
 {
 "token" : "is",
 "start\_offset" : 6,
 "end\_offset" : 8,
 "type" : "",
 "position" : 2
 },
 {
 "token" : "in",
 "start\_offset" : 9,
 "end\_offset" : 11,
 "type" : "",
 "position" : 3
 },
 {
 "token" : "the",
 "start\_offset" : 12,
 "end\_offset" : 15,
 "type" : "",
 "position" : 4
 },
 {
 "token" : "house",
 "start\_offset" : 16,
 "end\_offset" : 21,
 "type" : "",
 "position" : 5
 }
 ]
}

可以看到就是简单的按单词进行拆分,在接下来测试上面自定义的一个分词器es_std

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

返回:

{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "",
 "position" : 1
 },
 {
 "token" : "house",
 "start\_offset" : 16,
 "end\_offset" : 21,
 "type" : "",
 "position" : 5
 }
 ]
}

可以看到结果只有两个单词了,把停用词都给去掉了。

3、定制化自己的分词器

首先删除掉上面建立的索引

DELETE my_index

然后运行下面的语句。简单说下下面的规则吧,首先去除html标签,把&转换成and,然后采用standard进行分词,最后转换成小写字母及去掉停用词a the,建议读者好好看看,下面我也会对这个分词器进行测试。

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

返回

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

老规矩,测试这个分词器

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, , HAHA!!"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "",
 "position" : 0
 },
 {
 "token" : "are",
 "start\_offset" : 10,
 "end\_offset" : 13,
 "type" : "",
 "position" : 1
 },
 {
 "token" : "friend",
 "start\_offset" : 16,
 "end\_offset" : 22,
 "type" : "",
 "position" : 3
 },
 {
 "token" : "in",
 "start\_offset" : 23,
 "end\_offset" : 25,
 "type" : "",
 "position" : 4
 },
 {
 "token" : "house",
 "start\_offset" : 30,
 "end\_offset" : 35,
 "type" : "",
 "position" : 6
 },
 {
 "token" : "haha",
 "start\_offset" : 42,
 "end\_offset" : 46,
 "type" : "",
 "position" : 7
 }
 ]
}

最后我们可以在实际使用时设置某个字段使用自定义分词器,语法如下:

PUT /my_index/_mapping/
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

如果您觉得阅读本文对您有帮助,请点一下“**推荐**”按钮,您的**“推荐”**将是我最大的写作动力!欢迎各位转载,但是未经作者本人同意,转载文章之后**必须在文章页面明显位置给出作者和原文连接**,否则保留追究法律责任的权利。

转载请注明:xuhss » ElasticSearch7.3 学习之定制分词器(Analyzer)

喜欢 (0)

您必须 登录 才能发表评论!