本文主要研究一下langchain4j结合Apache Tika进行文档解析
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-tika</artifactId>
<version>1.0.0-beta1</version>
</dependency>
public class TikaTest {
public static void main(String[] args) {
String path = System.getProperty("user.home") + "/downloads/tmp.xlsx";
DocumentParser parser = new ApacheTikaDocumentParser();
Document document = FileSystemDocumentLoader.loadDocument(path, parser);
log.info("textSegment:{}", document.toTextSegment());
log.info("meta data:{}", document.metadata().toMap());
log.info("text:{}", document.text());
}
}
指定好了文件路径,通过ApacheTikaDocumentParser来解析,最后统一返回Document对象,它可以返回textSegment,这个可以跟向量数据库结合在一起
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
TextSegment segment1 = document.toTextSegment();
Embedding embedding1 = embeddingModel.embed(segment1).content();
embeddingStore.add(embedding1, segment1);
dev/langchain4j/data/document/DocumentParser.java
public interface DocumentParser {
/**
* Parses a given {@link InputStream} into a {@link Document}.
* The specific implementation of this method will depend on the type of the document being parsed.
* <p>
* Note: This method does not close the provided {@link InputStream} - it is the
* caller's responsibility to manage the lifecycle of the stream.
*
* @param inputStream The {@link InputStream} that contains the content of the {@link Document}.
* @return The parsed {@link Document}.
* @throws BlankDocumentException when the parsed {@link Document} is blank/empty.
*/
Document parse(InputStream inputStream);
}
DocumentParser定义了一个parse方法,根据inputStream返回Document
dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java
public class ApacheTikaDocumentParser implements DocumentParser {
private static final int NO_WRITE_LIMIT = -1;
public static final Supplier<Parser> DEFAULT_PARSER_SUPPLIER = AutoDetectParser::new;
public static final Supplier<Metadata> DEFAULT_METADATA_SUPPLIER = Metadata::new;
public static final Supplier<ParseContext> DEFAULT_PARSE_CONTEXT_SUPPLIER = ParseContext::new;
public static final Supplier<ContentHandler> DEFAULT_CONTENT_HANDLER_SUPPLIER =
() -> new BodyContentHandler(NO_WRITE_LIMIT);
private final Supplier<Parser> parserSupplier;
private final Supplier<ContentHandler> contentHandlerSupplier;
private final Supplier<Metadata> metadataSupplier;
private final Supplier<ParseContext> parseContextSupplier;
private final boolean includeMetadata;
/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
* It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
* empty {@link Metadata} and empty {@link ParseContext}.
* Note: By default, no metadata is added to the parsed document.
*/
public ApacheTikaDocumentParser() {
this(false);
}
/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the default Tika components.
* It uses {@link AutoDetectParser}, {@link BodyContentHandler} without write limit,
* empty {@link Metadata} and empty {@link ParseContext}.
*
* @param includeMetadata Whether to include metadata in the parsed document
*/
public ApacheTikaDocumentParser(boolean includeMetadata) {
this(null, null, null, null, includeMetadata);
}
/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the provided Tika components.
* If some of the components are not provided ({@code null}, the defaults will be used.
*
* @param parser Tika parser to use. Default: {@link AutoDetectParser}
* @param contentHandler Tika content handler. Default: {@link BodyContentHandler} without write limit
* @param metadata Tika metadata. Default: empty {@link Metadata}
* @param parseContext Tika parse context. Default: empty {@link ParseContext}
* @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.
*/
@Deprecated(forRemoval = true)
public ApacheTikaDocumentParser(
Parser parser, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) {
this(
() -> getOrDefault(parser, DEFAULT_PARSER_SUPPLIER),
() -> getOrDefault(contentHandler, DEFAULT_CONTENT_HANDLER_SUPPLIER),
() -> getOrDefault(metadata, DEFAULT_METADATA_SUPPLIER),
() -> getOrDefault(parseContext, DEFAULT_PARSE_CONTEXT_SUPPLIER),
false);
}
/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
* If some of the suppliers are not provided ({@code null}), the defaults will be used.
*
* @param parserSupplier Supplier for Tika parser to use. Default: {@link AutoDetectParser}
* @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
* @param metadataSupplier Supplier for Tika metadata. Default: empty {@link Metadata}
* @param parseContextSupplier Supplier for Tika parse context. Default: empty {@link ParseContext}
* @deprecated Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files
* and specify whether to include metadata or not.
*/
@Deprecated(forRemoval = true)
public ApacheTikaDocumentParser(
Supplier<Parser> parserSupplier,
Supplier<ContentHandler> contentHandlerSupplier,
Supplier<Metadata> metadataSupplier,
Supplier<ParseContext> parseContextSupplier) {
this(parserSupplier, contentHandlerSupplier, metadataSupplier, parseContextSupplier, false);
}
/**
* Creates an instance of an {@code ApacheTikaDocumentParser} with the provided suppliers for Tika components.
* If some of the suppliers are not provided ({@code null}), the defaults will be used.
*
* @param parserSupplier Supplier for Tika parser to use. Default: {@link AutoDetectParser}
* @param contentHandlerSupplier Supplier for Tika content handler. Default: {@link BodyContentHandler} without write limit
* @param metadataSupplier Supplier for Tika metadata. Default: empty {@link Metadata}
* @param parseContextSupplier Supplier for Tika parse context. Default: empty {@link ParseContext}
* @param includeMetadata Whether to include metadata in the parsed document
*/
public ApacheTikaDocumentParser(
Supplier<Parser> parserSupplier,
Supplier<ContentHandler> contentHandlerSupplier,
Supplier<Metadata> metadataSupplier,
Supplier<ParseContext> parseContextSupplier,
boolean includeMetadata) {
this.parserSupplier = getOrDefault(parserSupplier, () -> DEFAULT_PARSER_SUPPLIER);
this.contentHandlerSupplier = getOrDefault(contentHandlerSupplier, () -> DEFAULT_CONTENT_HANDLER_SUPPLIER);
this.metadataSupplier = getOrDefault(metadataSupplier, () -> DEFAULT_METADATA_SUPPLIER);
this.parseContextSupplier = getOrDefault(parseContextSupplier, () -> DEFAULT_PARSE_CONTEXT_SUPPLIER);
this.includeMetadata = includeMetadata;
}
@Override
public Document parse(InputStream inputStream) {
try {
Parser parser = parserSupplier.get();
ContentHandler contentHandler = contentHandlerSupplier.get();
Metadata metadata = metadataSupplier.get();
ParseContext parseContext = parseContextSupplier.get();
parser.parse(inputStream, contentHandler, metadata, parseContext);
String text = contentHandler.toString();
if (isNullOrBlank(text)) {
throw new BlankDocumentException();
}
return includeMetadata ? Document.from(text, convert(metadata)) : Document.from(text);
} catch (BlankDocumentException e) {
throw e;
} catch (ZeroByteFileException e) {
throw new BlankDocumentException();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
/**
* Converts a Tika {@link Metadata} object into a {@link dev.langchain4j.data.document.Metadata} object.
*
*
* @param tikaMetadata the {@code Metadata} object from the Tika library containing metadata information
* @return a {@link dev.langchain4j.data.document.Metadata} object representing in langchain4j format.
*/
private dev.langchain4j.data.document.Metadata convert(Metadata tikaMetadata) {
final Map<String, String> tikaMetaData = new HashMap<>();
for (String name : tikaMetadata.names()) {
tikaMetaData.put(name, String.join(";", tikaMetadata.getValues(name)));
}
return new dev.langchain4j.data.document.Metadata(tikaMetaData);
}
}
ApacheTikaDocumentParser实现了DocumentParser接口,其parse方法是通过parserSupplier获取parser,通过contentHandlerSupplier获取contentHandler,通过metadataSupplier获取metadata,通过parseContextSupplier获取parseContext,最后通过parser.parse(inputStream, contentHandler, metadata, parseContext)去解析,通过contentHandler.toString()获取text 默认的parser为AutoDetectParser,默认的contentHandler为BodyContentHandler,默认的parseContext为ParseContext
langchain4j提供了langchain4j-document-parser-apache-tika用于自动读取办公文档,然后解析成Document类型,它可以返回textSegment,这个可以跟向量数据库结合在一起。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。