前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >langchain4j+Tika小试牛刀

langchain4j+Tika小试牛刀

原创
作者头像
code4it
发布2025-03-07 09:18:41
发布2025-03-07 09:18:41
6900
代码可运行
举报
文章被收录于专栏:码匠的流水账码匠的流水账
运行总次数:0
代码可运行

本文主要研究一下langchain4j结合Apache Tika进行文档解析

步骤

pom.xml

代码语言:javascript
代码运行次数:0
复制
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-document-parser-apache-tika</artifactId>
            <version>1.0.0-beta1</version>
        </dependency>

example

代码语言:javascript
代码运行次数:0
复制
public class TikaTest {

    public static void main(String[] args) {
        String path = System.getProperty("user.home") + "/downloads/tmp.xlsx";
        DocumentParser parser = new ApacheTikaDocumentParser();
        Document document = FileSystemDocumentLoader.loadDocument(path, parser);
        log.info("textSegment:{}", document.toTextSegment());
        log.info("meta data:{}", document.metadata().toMap());
        log.info("text:{}", document.text());
    }
}

指定好了文件路径,通过ApacheTikaDocumentParser来解析,最后统一返回Document对象,它可以返回textSegment,这个可以跟向量数据库结合在一起

代码语言:javascript
代码运行次数:0
复制
			EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
            TextSegment segment1 = document.toTextSegment();
            Embedding embedding1 = embeddingModel.embed(segment1).content();
            embeddingStore.add(embedding1, segment1);

源码

DocumentParser

dev/langchain4j/data/document/DocumentParser.java

代码语言:javascript
代码运行次数:0
复制
public interface DocumentParser {

    /**
     * Parses a given {@link InputStream} into a {@link Document}.
     * The specific implementation of this method will depend on the type of the document being parsed.
     * <p>
     * Note: This method does not close the provided {@link InputStream} - it is the
     * caller's responsibility to manage the lifecycle of the stream.
     *
     * @param inputStream The {@link InputStream} that contains the content of the {@link Document}.
     * @return The parsed {@link Document}.
     * @throws BlankDocumentException when the parsed {@link Document} is blank/empty.
     */
    Document parse(InputStream inputStream);
}

DocumentParser定义了一个parse方法,根据inputStream返回Document

ApacheTikaDocumentParser

dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java

代码语言:javascript
代码运行次数:0
复制
public class ApacheTikaDocumentParser implements DocumentParser {

    private static final int NO_WRITE_LIMIT = -1;
    public static final Supplier<Parser> DEFAULT_PARSER_SUPPLIER = AutoDetectParser::new;
    public static final Supplier<Metadata> DEFAULT_METADATA_SUPPLIER = Metadata::new;
    public static final Supplier<ParseContext> DEFAULT_PARSE_CONTEXT_SUPPLIER = ParseContext::new;
    public static final Supplier<ContentHandler> DEFAULT_CONTENT_HANDLER_SUPPLIER =
            () -> new BodyContentHandler(NO_WRITE_LIMIT);

    private final Supplier<Parser> parserSupplier;
    private final Supplier<ContentHandler> contentHandlerSupplier;
    private final Supplier<Metadata> metadataSupplier;
    private final Supplier<ParseContext> parseContextSupplier;

    private final boolean includeMetadata;

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
     * It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
     * empty {@link Metadata} and empty {@link ParseContext}.
     * Note: By default, no metadata is added to the parsed document.
     */
    public ApacheTikaDocumentParser() {
        this(false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
     * It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
     * empty {@link Metadata} and empty {@link ParseContext}.
     *
     * @param includeMetadata        Whether to include metadata in the parsed document
     */
    public ApacheTikaDocumentParser(boolean includeMetadata) {
        this(null, null, null, null, includeMetadata);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided Tika components.
     * If some of the components are not provided ({@code null}, the defaults will be used.
     *
     * @param parser         Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandler Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadata       Tika metadata. Default: empty {@link Metadata}
     * @param parseContext   Tika parse context. Default: empty {@link ParseContext}
     * @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.
     */
    @Deprecated(forRemoval = true)
    public ApacheTikaDocumentParser(
            Parser parser, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) {
        this(
                () -> getOrDefault(parser, DEFAULT_PARSER_SUPPLIER),
                () -> getOrDefault(contentHandler, DEFAULT_CONTENT_HANDLER_SUPPLIER),
                () -> getOrDefault(metadata, DEFAULT_METADATA_SUPPLIER),
                () -> getOrDefault(parseContext, DEFAULT_PARSE_CONTEXT_SUPPLIER),
                false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
     * If some of the suppliers are not provided ({@code null}), the defaults will be used.
     *
     * @param parserSupplier         Supplier for Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadataSupplier       Supplier for Tika metadata. Default: empty {@link Metadata}
     * @param parseContextSupplier   Supplier for Tika parse context. Default: empty {@link ParseContext}
     * @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files
     * and specify whether to include metadata or not.
     */
    @Deprecated(forRemoval = true)
    public ApacheTikaDocumentParser(
            Supplier<Parser> parserSupplier,
            Supplier<ContentHandler> contentHandlerSupplier,
            Supplier<Metadata> metadataSupplier,
            Supplier<ParseContext> parseContextSupplier) {
        this(parserSupplier, contentHandlerSupplier, metadataSupplier, parseContextSupplier, false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
     * If some of the suppliers are not provided ({@code null}), the defaults will be used.
     *
     * @param parserSupplier         Supplier for Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadataSupplier       Supplier for Tika metadata. Default: empty {@link Metadata}
     * @param parseContextSupplier   Supplier for Tika parse context. Default: empty {@link ParseContext}
     * @param includeMetadata        Whether to include metadata in the parsed document
     */
    public ApacheTikaDocumentParser(
            Supplier<Parser> parserSupplier,
            Supplier<ContentHandler> contentHandlerSupplier,
            Supplier<Metadata> metadataSupplier,
            Supplier<ParseContext> parseContextSupplier,
            boolean includeMetadata) {
        this.parserSupplier = getOrDefault(parserSupplier, () -> DEFAULT_PARSER_SUPPLIER);
        this.contentHandlerSupplier = getOrDefault(contentHandlerSupplier, () -> DEFAULT_CONTENT_HANDLER_SUPPLIER);
        this.metadataSupplier = getOrDefault(metadataSupplier, () -> DEFAULT_METADATA_SUPPLIER);
        this.parseContextSupplier = getOrDefault(parseContextSupplier, () -> DEFAULT_PARSE_CONTEXT_SUPPLIER);
        this.includeMetadata = includeMetadata;
    }

    @Override
    public Document parse(InputStream inputStream) {
        try {
            Parser parser = parserSupplier.get();
            ContentHandler contentHandler = contentHandlerSupplier.get();
            Metadata metadata = metadataSupplier.get();
            ParseContext parseContext = parseContextSupplier.get();

            parser.parse(inputStream, contentHandler, metadata, parseContext);
            String text = contentHandler.toString();

            if (isNullOrBlank(text)) {
                throw new BlankDocumentException();
            }

            return includeMetadata ? Document.from(text, convert(metadata)) : Document.from(text);
        } catch (BlankDocumentException e) {
            throw e;
        } catch (ZeroByteFileException e) {
            throw new BlankDocumentException();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Converts a Tika {@link Metadata} object into a {@link dev.langchain4j.data.document.Metadata} object.
     *
     *
     * @param tikaMetadata the {@code Metadata} object from the Tika library containing metadata information
     * @return a {@link dev.langchain4j.data.document.Metadata} object representing in langchain4j format.
     */
    private dev.langchain4j.data.document.Metadata convert(Metadata tikaMetadata) {

        final Map<String, String> tikaMetaData = new HashMap<>();

        for (String name : tikaMetadata.names()) {
            tikaMetaData.put(name, String.join(";", tikaMetadata.getValues(name)));
        }

        return new dev.langchain4j.data.document.Metadata(tikaMetaData);
    }
}

ApacheTikaDocumentParser实现了DocumentParser接口,其parse方法是通过parserSupplier获取parser,通过contentHandlerSupplier获取contentHandler,通过metadataSupplier获取metadata,通过parseContextSupplier获取parseContext,最后通过parser.parse(inputStream, contentHandler, metadata, parseContext)去解析,通过contentHandler.toString()获取text 默认的parser为AutoDetectParser,默认的contentHandler为BodyContentHandler,默认的parseContext为ParseContext

小结

langchain4j提供了langchain4j-document-parser-apache-tika用于自动读取办公文档,然后解析成Document类型,它可以返回textSegment,这个可以跟向量数据库结合在一起。

doc

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 步骤
    • pom.xml
    • example
  • 源码
    • DocumentParser
    • ApacheTikaDocumentParser
  • 小结
  • doc
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档