首页
学习
活动
专区
圈层
工具
发布
社区首页 >专栏 >聊聊Spring AI Alibaba的BilibiliDocumentReader

聊聊Spring AI Alibaba的BilibiliDocumentReader

原创
作者头像
code4it
发布2025-04-18 09:13:45
发布2025-04-18 09:13:45
1140
举报
文章被收录于专栏:码匠的流水账码匠的流水账

本文主要研究一下Spring AI Alibaba的BilibiliDocumentReader

BilibiliDocumentReader

community/document-readers/spring-ai-alibaba-starter-document-reader-bilibili/src/main/java/com/alibaba/cloud/ai/reader/bilibili/BilibiliDocumentReader.java

代码语言:javascript
复制
public class BilibiliDocumentReader implements DocumentReader {

	private static final Logger logger = LoggerFactory.getLogger(BilibiliDocumentReader.class);

	private static final String API_BASE_URL = "https://api.bilibili.com/x/web-interface/view?bvid=";

	private final String resourcePath;

	private final ObjectMapper objectMapper;

	private static final int MEMORY_SIZE = 5;

	private static final int BYTE_SIZE = 1024;

	private static final int MAX_MEMORY_SIZE = MEMORY_SIZE * BYTE_SIZE * BYTE_SIZE;

	private static final WebClient WEB_CLIENT = WebClient.builder()
		.defaultHeader(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE)
		.codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(MAX_MEMORY_SIZE))
		.build();

	public BilibiliDocumentReader(String resourcePath) {
		Assert.hasText(resourcePath, "Query string must not be empty");
		this.resourcePath = resourcePath;
		this.objectMapper = new ObjectMapper();
	}

	@Override
	public List<Document> get() {
		List<Document> documents = new ArrayList<>();
		try {
			String bvid = extractBvid(resourcePath);
			String videoInfoResponse = fetchVideoInfo(bvid);
			JsonNode videoData = parseJson(videoInfoResponse).path("data");
			String title = videoData.path("title").asText();
			String description = videoData.path("desc").asText();
			Document infoDoc = new Document("Video information", Map.of("title", title, "description", description));
			documents.add(infoDoc);
			String documentContent = fetchAndProcessSubtitles(videoData, title, description);
			documents.add(new Document(documentContent));
		}
		catch (IllegalArgumentException e) {
			logger.error("Invalid input: {}", e.getMessage());
			documents.add(new Document("Error: Invalid input"));
		}
		catch (IOException e) {
			logger.error("Error parsing JSON: {}", e.getMessage(), e);
			documents.add(new Document("Error parsing JSON: " + e.getMessage()));
		}
		catch (Exception e) {
			logger.error("Unexpected error: {}", e.getMessage(), e);
			documents.add(new Document("Unexpected error: " + e.getMessage()));
		}
		return documents;
	}

	private String extractBvid(String resourcePath) {
		return resourcePath.replaceAll(".*(BV\\w+).*", "$1");
	}

	private String fetchVideoInfo(String bvid) {
		return WEB_CLIENT.get().uri(API_BASE_URL + bvid).retrieve().bodyToMono(String.class).block();
	}

	private JsonNode parseJson(String jsonResponse) throws IOException {
		return objectMapper.readTree(jsonResponse);
	}

	private String fetchAndProcessSubtitles(JsonNode videoData, String title, String description) throws IOException {
		JsonNode subtitleList = videoData.path("subtitle").path("list");
		if (subtitleList.isArray() && subtitleList.size() > 0) {
			String subtitleUrl = subtitleList.get(0).path("subtitle_url").asText();
			String subtitleResponse = WEB_CLIENT.get().uri(subtitleUrl).retrieve().bodyToMono(String.class).block();

			JsonNode subtitleJson = parseJson(subtitleResponse);
			StringBuilder rawTranscript = new StringBuilder();
			subtitleJson.path("body").forEach(node -> rawTranscript.append(node.path("content").asText()).append(" "));

			return String.format("Video Title: %s, Description: %s\nTranscript: %s", title, description,
					rawTranscript.toString().trim());
		}
		else {
			return String.format("No subtitles found for video: %s. Returning an empty transcript.", resourcePath);
		}
	}

}

BilibiliDocumentReader使用WebClient去请求B站接口,它从url解析bvid,再根据bvid去请求接口,解析json获取title、description,通过fetchAndProcessSubtitles再去请求subtitle_url获取字幕内容作为document的内容

示例

代码语言:javascript
复制
public class BilibiliDocumentReaderTest {

	private static final Logger logger = LoggerFactory.getLogger(BilibiliDocumentReader.class);

	@Test
	void bilibiliDocumentReaderTest() {
		BilibiliDocumentReader bilibiliDocumentReader = new BilibiliDocumentReader(
				"https://www.bilibili.com/video/BV1KMwgeKECx/?t=7&vd_source=3069f51b168ac07a9e3c4ba94ae26af5");
		List<Document> documents = bilibiliDocumentReader.get();
		logger.info("documents: {}", documents);
	}

}

小结

spring-ai-alibaba-starter-document-reader-bilibili提供了BilibiliDocumentReader用于解析B站的视频url,它请求两次接口,一次获取title和description,一次获取字幕。

doc

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • BilibiliDocumentReader
  • 示例
  • 小结
  • doc
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档