langchain4j+Tika小试牛刀

原创

code4it

发布于 2025-03-07 09:18:41

6900

代码可运行

文章被收录于专栏：码匠的流水账码匠的流水账

运行总次数：0

代码可运行

序

本文主要研究一下langchain4j结合Apache Tika进行文档解析

步骤

pom.xml

        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-document-parser-apache-tika</artifactId>
            <version>1.0.0-beta1</version>
        </dependency>

example

public class TikaTest {

    public static void main(String[] args) {
        String path = System.getProperty("user.home") + "/downloads/tmp.xlsx";
        DocumentParser parser = new ApacheTikaDocumentParser();
        Document document = FileSystemDocumentLoader.loadDocument(path, parser);
        log.info("textSegment:{}", document.toTextSegment());
        log.info("meta data:{}", document.metadata().toMap());
        log.info("text:{}", document.text());
    }
}

指定好了文件路径，通过ApacheTikaDocumentParser来解析，最后统一返回Document对象，它可以返回textSegment，这个可以跟向量数据库结合在一起

			EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
            TextSegment segment1 = document.toTextSegment();
            Embedding embedding1 = embeddingModel.embed(segment1).content();
            embeddingStore.add(embedding1, segment1);

源码

DocumentParser

dev/langchain4j/data/document/DocumentParser.java

public interface DocumentParser {

    /**
     * Parses a given {@link InputStream} into a {@link Document}.
     * The specific implementation of this method will depend on the type of the document being parsed.
     * <p>
     * Note: This method does not close the provided {@link InputStream} - it is the
     * caller's responsibility to manage the lifecycle of the stream.
     *
     * @param inputStream The {@link InputStream} that contains the content of the {@link Document}.
     * @return The parsed {@link Document}.
     * @throws BlankDocumentException when the parsed {@link Document} is blank/empty.
     */
    Document parse(InputStream inputStream);
}

DocumentParser定义了一个parse方法，根据inputStream返回Document

ApacheTikaDocumentParser

dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java

public class ApacheTikaDocumentParser implements DocumentParser {

    private static final int NO_WRITE_LIMIT = -1;
    public static final Supplier<Parser> DEFAULT_PARSER_SUPPLIER = AutoDetectParser::new;
    public static final Supplier<Metadata> DEFAULT_METADATA_SUPPLIER = Metadata::new;
    public static final Supplier<ParseContext> DEFAULT_PARSE_CONTEXT_SUPPLIER = ParseContext::new;
    public static final Supplier<ContentHandler> DEFAULT_CONTENT_HANDLER_SUPPLIER =
            () -> new BodyContentHandler(NO_WRITE_LIMIT);

    private final Supplier<Parser> parserSupplier;
    private final Supplier<ContentHandler> contentHandlerSupplier;
    private final Supplier<Metadata> metadataSupplier;
    private final Supplier<ParseContext> parseContextSupplier;

    private final boolean includeMetadata;

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
     * It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
     * empty {@link Metadata} and empty {@link ParseContext}.
     * Note: By default, no metadata is added to the parsed document.
     */
    public ApacheTikaDocumentParser() {
        this(false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
     * It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
     * empty {@link Metadata} and empty {@link ParseContext}.
     *
     * @param includeMetadata        Whether to include metadata in the parsed document
     */
    public ApacheTikaDocumentParser(boolean includeMetadata) {
        this(null, null, null, null, includeMetadata);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided Tika components.
     * If some of the components are not provided ({@code null}, the defaults will be used.
     *
     * @param parser         Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandler Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadata       Tika metadata. Default: empty {@link Metadata}
     * @param parseContext   Tika parse context. Default: empty {@link ParseContext}
     * @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.
     */
    @Deprecated(forRemoval = true)
    public ApacheTikaDocumentParser(
            Parser parser, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) {
        this(
                () -> getOrDefault(parser, DEFAULT_PARSER_SUPPLIER),
                () -> getOrDefault(contentHandler, DEFAULT_CONTENT_HANDLER_SUPPLIER),
                () -> getOrDefault(metadata, DEFAULT_METADATA_SUPPLIER),
                () -> getOrDefault(parseContext, DEFAULT_PARSE_CONTEXT_SUPPLIER),
                false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
     * If some of the suppliers are not provided ({@code null}), the defaults will be used.
     *
     * @param parserSupplier         Supplier for Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadataSupplier       Supplier for Tika metadata. Default: empty {@link Metadata}
     * @param parseContextSupplier   Supplier for Tika parse context. Default: empty {@link ParseContext}
     * @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files
     * and specify whether to include metadata or not.
     */
    @Deprecated(forRemoval = true)
    public ApacheTikaDocumentParser(
            Supplier<Parser> parserSupplier,
            Supplier<ContentHandler> contentHandlerSupplier,
            Supplier<Metadata> metadataSupplier,
            Supplier<ParseContext> parseContextSupplier) {
        this(parserSupplier, contentHandlerSupplier, metadataSupplier, parseContextSupplier, false);
    }

    /**
     * Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
     * If some of the suppliers are not provided ({@code null}), the defaults will be used.
     *
     * @param parserSupplier         Supplier for Tika parser to use. Default: {@link AutoDetectParser}
     * @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
     * @param metadataSupplier       Supplier for Tika metadata. Default: empty {@link Metadata}
     * @param parseContextSupplier   Supplier for Tika parse context. Default: empty {@link ParseContext}
     * @param includeMetadata        Whether to include metadata in the parsed document
     */
    public ApacheTikaDocumentParser(
            Supplier<Parser> parserSupplier,
            Supplier<ContentHandler> contentHandlerSupplier,
            Supplier<Metadata> metadataSupplier,
            Supplier<ParseContext> parseContextSupplier,
            boolean includeMetadata) {
        this.parserSupplier = getOrDefault(parserSupplier, () -> DEFAULT_PARSER_SUPPLIER);
        this.contentHandlerSupplier = getOrDefault(contentHandlerSupplier, () -> DEFAULT_CONTENT_HANDLER_SUPPLIER);
        this.metadataSupplier = getOrDefault(metadataSupplier, () -> DEFAULT_METADATA_SUPPLIER);
        this.parseContextSupplier = getOrDefault(parseContextSupplier, () -> DEFAULT_PARSE_CONTEXT_SUPPLIER);
        this.includeMetadata = includeMetadata;
    }

    @Override
    public Document parse(InputStream inputStream) {
        try {
            Parser parser = parserSupplier.get();
            ContentHandler contentHandler = contentHandlerSupplier.get();
            Metadata metadata = metadataSupplier.get();
            ParseContext parseContext = parseContextSupplier.get();

            parser.parse(inputStream, contentHandler, metadata, parseContext);
            String text = contentHandler.toString();

            if (isNullOrBlank(text)) {
                throw new BlankDocumentException();
            }

            return includeMetadata ? Document.from(text, convert(metadata)) : Document.from(text);
        } catch (BlankDocumentException e) {
            throw e;
        } catch (ZeroByteFileException e) {
            throw new BlankDocumentException();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Converts a Tika {@link Metadata} object into a {@link dev.langchain4j.data.document.Metadata} object.
     *
     *
     * @param tikaMetadata the {@code Metadata} object from the Tika library containing metadata information
     * @return a {@link dev.langchain4j.data.document.Metadata} object representing in langchain4j format.
     */
    private dev.langchain4j.data.document.Metadata convert(Metadata tikaMetadata) {

        final Map<String, String> tikaMetaData = new HashMap<>();

        for (String name : tikaMetadata.names()) {
            tikaMetaData.put(name, String.join(";", tikaMetadata.getValues(name)));
        }

        return new dev.langchain4j.data.document.Metadata(tikaMetaData);
    }
}

ApacheTikaDocumentParser实现了DocumentParser接口，其parse方法是通过parserSupplier获取parser，通过contentHandlerSupplier获取contentHandler，通过metadataSupplier获取metadata，通过parseContextSupplier获取parseContext，最后通过parser.parse(inputStream, contentHandler, metadata, parseContext)去解析，通过contentHandler.toString()获取text 默认的parser为AutoDetectParser，默认的contentHandler为BodyContentHandler，默认的parseContext为ParseContext

小结

langchain4j提供了langchain4j-document-parser-apache-tika用于自动读取办公文档，然后解析成Document类型，它可以返回textSegment，这个可以跟向量数据库结合在一起。

doc

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

LLM

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

LLM

登录后参与评论

0 条评论

热度

langchain4j+Tika小试牛刀

langchain4j+Tika小试牛刀

序

步骤

pom.xml

example

源码

DocumentParser

ApacheTikaDocumentParser

小结

doc

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐