es 分词器(五)之elasticsearch-analysis-jieba 8.7.0

es 分词器(五)之elasticsearch-analysis-jieba 8.7.0

今天咱们就来讲一下es jieba 8.7.0 分词器的实现,以及8.x其它版本的实现方式,如果想直接使用es 结巴8.x版本,请直接修改pom文件的elasticsearch.version版本号即可,然后打包安装就行,不需要做太多的操作。

一、elasticsearch-jieba-plugin

最近更新的版本为8.4.1,最近更新的时间停留在2022年,从这之后便无人维护此开源项目
GitHub地址:​​https://github.com/sing1ee/elasticsearch-jieba-plugin/tree/8.4.1​​

二、elasticsearch-analysis-jieba

最近更新的版本为6.8.17,比上面的插件更惨,已经有三年无人维护了。
Github地址:​​https://github.com/huaban/elasticsearch-analysis-jieba/tree/dependabot/maven/org.elasticsearch-elasticsearch-6.8.17​​

三、决定换壳elasticsearch-jieba-plugin

当前我开发的项目采用的版本为8.7.0,目前在网上无法找到与之匹配的版本。
ik分词器用户比jieba分词器用户多,因为会对应的es版本不断更新,目前ik分词器的版本已经更新至8.12.2,2024年5月14日位置es的最新版本为8.14.x
2024年5月14日es最新版本为8.14.x

四、编译elasticsearch-analysis-jieba分词器

由于原有的插件【elasticsearch-analysis-jieba】已经很久没有人使用,但我又感觉【elasticsearch-analysis-jieba】这个名称比【elasticsearch-jieba-plugin】【https://github.com/sing1ee/elasticsearch-jieba-plugin/tree/8.4.1】这个好听一点,所以我本地新开了一个【elasticsearch-analysis-jieba】项目,将这个【elasticsearch-jieba-plugin】这个项目的代码复制到新建的项目中,因为这个【elasticsearch-jieba-plugin】使用的是gradle管理,我想使用的是maven仓库,所以修改了一下。

image-20240515221232488

4.1 新增pom.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><name>elasticsearch-analysis-jieba</name><modelVersion>4.0.0</modelVersion><groupId>org.elasticsearch</groupId><artifactId>elasticsearch-analysis-jieba</artifactId><version>${elasticsearch.version}</version><packaging>jar</packaging><description>jieba Analyzer for Elasticsearch</description><inceptionYear>2011</inceptionYear><properties><elasticsearch.version>8.7.0</elasticsearch.version><maven.compiler.target>17</maven.compiler.target><elasticsearch.assembly.descriptor>${project.basedir}/src/main/assemblies/plugin.xml</elasticsearch.assembly.descriptor><elasticsearch.plugin.name>analysis-jieba</elasticsearch.plugin.name><elasticsearch.plugin.classname>org.elasticsearch.plugin.analysis.jieba.AnalysisJiebaPlugin</elasticsearch.plugin.classname><elasticsearch.plugin.jvm>true</elasticsearch.plugin.jvm><tests.rest.load_packaged>false</tests.rest.load_packaged><skip.unit.tests>true</skip.unit.tests></properties><licenses><license><name>The Apache Software License, Version 2.0</name><url>http://www.apache.org/licenses/LICENSE-2.0.txt</url><distribution>repo</distribution></license></licenses><developers><developer><name>INFINI Labs</name><email>hello@infini.ltd</email><organization>INFINI Labs</organization><organizationUrl>https://infinilabs.com</organizationUrl></developer></developers><parent><groupId>org.sonatype.oss</groupId><artifactId>oss-parent</artifactId><version>9</version></parent><distributionManagement><snapshotRepository><id>oss.sonatype.org</id><url>https://oss.sonatype.org/content/repositories/snapshots</url></snapshotRepository><repository><id>oss.sonatype.org</id><url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url></repository></distributionManagement><repositories><repository><id>oss.sonatype.org</id><name>OSS Sonatype</name><releases><enabled>true</enabled></releases><snapshots><enabled>true</enabled></snapshots><url>https://oss.sonatype.org/content/repositories/releases/</url></repository></repositories><dependencies><dependency><groupId>org.elasticsearch</groupId><artifactId>elasticsearch</artifactId><version>${elasticsearch.version}</version><scope>compile</scope></dependency><dependency><groupId>com.huaban</groupId><artifactId>jieba-analysis</artifactId><version>1.0.2</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-api</artifactId><version>2.19.0</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.13.2</version><scope>test</scope></dependency></dependencies><build><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.5.1</version><configuration><source>${maven.compiler.target}</source><target>${maven.compiler.target}</target></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-surefire-plugin</artifactId><version>2.11</version><configuration><includes><include>**/*Tests.java</include></includes></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-source-plugin</artifactId><version>2.1.2</version><executions><execution><id>attach-sources</id><goals><goal>jar</goal></goals></execution></executions></plugin><plugin><artifactId>maven-assembly-plugin</artifactId><configuration><appendAssemblyId>false</appendAssemblyId><outputDirectory>${project.build.directory}/releases/</outputDirectory><descriptors><descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor></descriptors><archive><manifest><mainClass>fully.qualified.MainClass</mainClass></manifest></archive></configuration><executions><execution><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build><profiles><profile><id>disable-java8-doclint</id><activation><jdk>[1.8,)</jdk></activation><properties><additionalparam>-Xdoclint:none</additionalparam></properties></profile><profile><id>release</id><build><plugins><plugin><groupId>org.sonatype.plugins</groupId><artifactId>nexus-staging-maven-plugin</artifactId><version>1.6.3</version><extensions>true</extensions><configuration><serverId>oss</serverId><nexusUrl>https://oss.sonatype.org/</nexusUrl><autoReleaseAfterClose>true</autoReleaseAfterClose></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-release-plugin</artifactId><version>2.1</version><configuration><autoVersionSubmodules>true</autoVersionSubmodules><useReleaseProfile>false</useReleaseProfile><releaseProfiles>release</releaseProfiles><goals>deploy</goals></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>3.5.1</version><configuration><source>${maven.compiler.target}</source><target>${maven.compiler.target}</target></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-gpg-plugin</artifactId><version>1.5</version><executions><execution><id>sign-artifacts</id><phase>verify</phase><goals><goal>sign</goal></goals></execution></executions></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-source-plugin</artifactId><version>2.2.1</version><executions><execution><id>attach-sources</id><goals><goal>jar-no-fork</goal></goals></execution></executions></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-javadoc-plugin</artifactId><version>2.9</version><executions><execution><id>attach-javadocs</id><goals><goal>jar</goal></goals></execution></executions></plugin></plugins></build></profile></profiles>
</project>

4.2 修改plugin-descriptor.properties文件

# Elasticsearch plugin descriptor file
# This file must exist as 'plugin-descriptor.properties' at
# the root directory of all plugins.
#
# A plugin can be 'site', 'jvm', or both.
#
### example site plugin for "foo":
#
# foo.zip <-- zip file for the plugin, with this structure:
#   _site/ <-- the contents that will be served
#   plugin-descriptor.properties <-- example contents below:
#
# site=true
# description=My cool plugin
# version=1.0
#
### example jvm plugin for "foo"
#
# foo.zip <-- zip file for the plugin, with this structure:
#   <arbitrary name1>.jar <-- classes, resources, dependencies
#   <arbitrary nameN>.jar <-- any number of jars
#   plugin-descriptor.properties <-- example contents below:
#
# jvm=true
# classname=foo.bar.BazPlugin
# description=My cool plugin
# version=2.0.0-rc1
# elasticsearch.version=2.0
# java.version=1.7
#
### mandatory elements for all plugins:
#
# 'description': simple summary of the plugin
description=${project.description}
#
# 'version': plugin's version
version=${project.version}
#
# 'name': the plugin name
name=${elasticsearch.plugin.name}
#
# 'classname': the name of the class to load, fully-qualified.
classname=${elasticsearch.plugin.classname}
#
# 'java.version' version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=${maven.compiler.target}
#
# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=${elasticsearch.version}

4.3 新增plugin-security.policy文件

grant {// needed because of the hot reload functionalitypermission java.net.SocketPermission "*", "connect,resolve";permission java.lang.RuntimePermission "setContextClassLoader";
};

4.4 构建插件

打包

image-20240515221635642

找到打包之后的zip包

image-20240515221711379

放到elasticsearch-8.7.0/plugin/analysis-jieba目录下。

image-20240515221917297

现在,再手动重启一下es就将elasticsearch-analysis-jieba分词器安装好啦。

五、测试jieba分词器

在kibana中创建索引

PUT jieba_index
{"settings": {"analysis": {"analyzer": {"my_ana": {"tokenizer": "jieba_index","filter": ["lowercase"]}}}}
}

文本分词器

PUT jieba_index/_analyze
{"analyzer" : "my_ana","text" : "黄河之水天上来"
}

返回结果

{"tokens": [{"token": "黄河","start_offset": 0,"end_offset": 2,"type": "word","position": 0},{"token": "黄河之水天上来","start_offset": 0,"end_offset": 7,"type": "word","position": 0},{"token": "之水","start_offset": 2,"end_offset": 4,"type": "word","position": 1},{"token": "天上","start_offset": 4,"end_offset": 6,"type": "word","position": 2},{"token": "上来","start_offset": 5,"end_offset": 7,"type": "word","position": 2}]
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/704793.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

那些年我与c++的叫板(一)--string类自实现

引子&#xff1a;我们学习了c中的string类&#xff0c;那我们能不能像以前数据结构一样自己实现string类呢&#xff1f;以下是cplusplus下的string类&#xff0c;我们参考参考&#xff01; 废话不多说&#xff0c;直接代码实现&#xff1a;&#xff08;注意函数之间的复用&…

Shell之(数组)

目录 一、shell数组 1.数组的定义 2.定义数组的方法 第一种 第二种 第三种 第四种 3.数组分片 4. 数组字符替换 临时替换 永久替换 5.删除数组 删除指定的下标 删除整组 6.数组遍历和重新定义 7.数组追加元素 方式一&#xff1a;指定位置添加 方法二&a…

AnyMP4 Video Converter for Mac/Win - 视频转换的卓越之选

在当今数字化的时代&#xff0c;视频内容无处不在&#xff0c;而拥有一款强大的视频转换器就显得至关重要。AnyMP4 Video Converter for Mac/win 正是这样一款出类拔萃的工具&#xff0c;为您带来高效、便捷的视频转换体验。 这款视频转换器具备令人惊叹的功能。它支持广泛的视…

GAME101-Lecture07学习

前言 今天主要讲shading&#xff08;着色&#xff09;。在讲着色前&#xff0c;要先讲图形中三角形出现遮挡问题的方法&#xff08;深度缓存或缓冲&#xff09;。 先采样再模糊错误&#xff1a;对信号的频谱进行翻译&#xff08;在这期间会有频谱的混叠&#xff09;&#xff…

【ARMv8/v9 系统寄存器 5 -- ARMv8 Cache 控制寄存器 SCTRL_EL1 使用详细介绍】

关于ARM Cache 详细学习推荐专栏&#xff1a; 【ARM Cache 专栏】 【ARM ACE Bus 与 Cache 专栏】 文章目录 ARMv8/v9 Cache 设置寄存器ARMv8 指令 Cache 使能函数测试代码 ARMv8/v9 Cache 设置寄存器 关于寄存器SCTRL_EL1 的详细介绍见文章&#xff1a;【ARMv8/v9 异常模型入…

char x[]---char*---string---sizeof

字符串数组 #include <iostream>int main(){char c_str[]"abcd";char c_str1[]{a,b,c,d};std::cout<<sizeof(c_str)<<std::endl;std::cout<<sizeof(c_str1)<<std::endl;return 0; } char*存储的字符串个数 char*字符串所占字节大小 c…

洛谷P1364 医院设置

P1364 医院设置 题目描述 设有一棵二叉树&#xff0c;如图&#xff1a; 其中&#xff0c;圈中的数字表示结点中居民的人口。圈边上数字表示结点编号&#xff0c;现在要求在某个结点上建立一个医院&#xff0c;使所有居民所走的路程之和为最小&#xff0c;同时约定&#xff0c…

【计算机毕业设计】springboot二手家电管理平台

时代在飞速进步&#xff0c;每个行业都在努力发展现在先进技术&#xff0c;通过这些先进的技术来提高自己的水平和优势&#xff0c;二手家电管理平台当然不能排除在外。二手家电管理平台是在实际应用和 软件工程的开发原理之上&#xff0c;运用java语言以及前台VUE框架&#xf…

【操作系统期末速成】​内存管理|内存的装入模块在装入内存的方式|分配管理方式|页面置换算法|页面置换

&#x1f3a5; 个人主页&#xff1a;深鱼~&#x1f525;收录专栏&#xff1a;操作系统&#x1f304;欢迎 &#x1f44d;点赞✍评论⭐收藏 推荐 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到…

每周一算法:恰好经过K条边的最短路

题目描述 牛站 给定一张由 M M M 条边构成的无向图&#xff0c;点的编号为 1 ∼ 1000 1\sim 1000 1∼1000 之间的整数。 求从起点 S S S 到终点 E E E 恰好经过 K K K 条边&#xff08;可以重复经过&#xff09;的最短路。 注意: 数据保证一定有解。 输入格式 第 1 …

万字长文破解 AI 图片生成算法-Stable diffusion (第一篇)

想象一下&#xff1a;你闭上眼睛&#xff0c;脑海中构思一个场景&#xff0c;用简短的语言描述出来&#xff0c;然后“啪”的一声&#xff0c;一张栩栩如生的图片就出现在你眼前。这不再是科幻小说里才有的情节&#xff0c;而是Stable Diffusion——一种前沿的AI图片生成算法—…

「AIGC」Python实现tokens算法

本文主要介绍通过python实现tokens统计,避免重复调用openai等官方api,开源节流。 一、设计思路 初始化tokenizer使用tokenizer将文本转换为tokens计算token的数量二、业务场景 2.1 首次加载依赖 2.2 执行业务逻辑 三、核心代码 from transformers import AutoTokenizer imp…