应用jieba分词（java版）并提供jar包

原创

languageX

发布于 2023-06-02 15:50:37

4K00

代码可运行

文章被收录于专栏：计算机视觉CV计算机视觉CV

运行总次数：0

代码可运行

huaban/jieba-analysis是java版本最常用的分词工具。github上很详细的介绍了使用方法。

但是如何应用jieba分词提取自己的词库表，并将自己接口功能一起打jar包的过程网上教程较少。

本文主要介绍在java中如何使用jieba分词，在jieba分词中使用自己的词库，以及在提供jar包的过程中会遇到的问题和解决方法。

1 使用jieba分词

我们不用自己去造一个jieba分词的java版本轮子，使用开源jieba-analysi工具～

在pom文件中添加依赖：

 <dependencies>
        <dependency>
            <groupId>com.huaban</groupId>
            <artifactId>jieba-analysis</artifactId>
            <version>1.0.2</version>
        </dependency>
    </dependencies>

使用也比较简单

import com.huaban.analysis.jieba.JiebaSegmenter;
import com.huaban.analysis.jieba.WordDictionary;

JiebaSegmenter segmenter;
segmenter = new JiebaSegmenter();
List result = segmenter.sentenceProcess(info_str);

以上代码就能对info_str字符串进行分词了～

2 使用自定义词库

有时候我们有特殊的需求，不想用默认的词库，而希望用自定义的词库dict.txt。

坑开始了，网上大多数做法是如下代码：

将词库dict.txt放在resource文件夹下，然后通过this.getClass().getResource("/dict.txt")获取资源路径后，直接使用loadUserDict接口加载词库

URL file_path = this.getClass().getResource("/dict.txt");
Path path = Paths.get(new File(this.getClass().getResource("").getPath() + "/dict.txt").getAbsolutePath());
WordDictionary.getInstance().loadUserDict(path);

以上代码，在本地跑没有任何问题，但是我们在第三步打jar包就会出现

’the return value of "java.lang.ClassLoader.getResource(String)" is null '错误

或者file not found错误

或者user dict load failure错误

3 提供jar包

先抛开jieba库加载词典的问题，我们如果想在java中加载文本资源，通常会使用如下方式：

String filePath = this.getClass().getClassLoader().getResource("dict.txt").getPath();
try (BufferedReader br = new BufferedReader(newFileReader(filePath))) {
    String line;
    while ((line = br.readLine()) != null) {
        txt_list.add(line);
    }
} catch (IOException e) {
        e.printStackTrace();
}

但是在打jar包后，使用java -jar验证会出现dict.txt资源无法找到的错误。

解决方案：

需要使用InputStream is= this.getClass().getResourceAsStream("/dict.txt")

List tag_list = new ArrayList<>();
        try (InputStream is= this.getClass().getResourceAsStream("/dict.txt")){
            BufferedReader br = new BufferedReader(new InputStreamReader(is));
            String line;
            while ((line = br.readLine()) != null) {
                tag_list.add(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

回到jieba分词加载本地词库的问题，同样我们不能直接使用getResource的方法，而希望使用getResourceAsStream方案。

那怎么办呢？重载自定义接口？不用，我们进入WordDictionary类，可以发现作者已经给我们解决了这个问题

我们看下loadDict源码：

public void loadDict() {
        this._dict = new DictSegment('\u0000');
        InputStream is = this.getClass().getResourceAsStream("/dict.txt");

        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
            long s = System.currentTimeMillis();

所以，我们只需要换个接口

WordDictionary.getInstance().loadDict();

需要注意到是，自定义的词典必须命名为dict.txt，而且每行必须是词词频词性的格式。

如果你的词典只有词，也会出现加载无效的问题。

4 其他问题

4.1 java版本配置问题

项目从mac平台移植到win后，出现

java: 警告: 源发行版 9 需要目标发行版 9

只要是java配置不同，修改下就好，9改为8

<configuratio>
    <source> 8 </source>
    <target> 8 </target>
</configuration>

4.2 平台移植导致的编码问题

同样是平台移植后出现的问题，在mac显示正常的逻辑，在win上逻辑出错。

通过将读取的line打印出来，全为乱码，但是结巴库能正常显示。

这明显就是编码问题，编码问题之前我也专门介绍过

解决方案：

参考源码，加入编码格式就好

 BufferedReader br = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));

4.3 xxx-1.0.jar中没有主清单属性

打jar包如果出现没有主清单属性问题

在pom.xml中添加

            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <includeSystemScope>true</includeSystemScope>
                </configuration>
                <version>2.0.1.RELEASE</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>