前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【LangChain系列】第九节:LLM 应用评估

【LangChain系列】第九节:LLM 应用评估

原创
作者头像
Freedom123
发布2024-05-25 20:40:40
630
发布2024-05-25 20:40:40
举报
文章被收录于专栏:AIGCAIGC

toc


随着语言模型(LLMs)的不断进步,它们的应用变得越来越复杂和精密。随着这种复杂性的增加,评估这些基于LLM的应用程序的性能和准确性也变得更具挑战性。在这篇博客文章中,我们将深入探讨LLM应用评估的世界,探讨可以帮助您评估和改进模型性能的框架和工具。

一、创建QA应用程序

代码语言:python
代码运行次数:0
复制
import os
from dotenv import load_dotenv, find_dotenv
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores.docarray import DocArrayInMemorySearch
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_openai import ChatOpenAI

_ = load_dotenv(find_dotenv())
notebook_path = os.path.abspath("__file__")
notebook_directory = os.path.dirname(notebook_path)
csv_file_path = os.path.join(notebook_directory, '..', 'OutdoorClothingCatalog_1000.csv')
loader = CSVLoader(file_path=csv_file_path)
data = loader.load()
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders(
    [loader]
)
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={"document_separator": "<<<<>>>>>"},
)

二、构建测试数据

在我们评估LLM应用程序之前,我们需要一组可靠的测试数据。生成测试数据有两种主要方法:

1.手动创建示例

传统的方法涉及手动审查您的数据并制作查询-答案对。假设您正在使用一个服装数据集。您可以浏览描述并创建问题,比如“Cozy Comfort Pullover Set有侧口袋吗?”并提供相应的答案。虽然这种方法让您完全控制示例,但它可能会耗费时间,并且在处理更大的数据集时可能不太容易扩展。

代码语言:shell
复制
# Hard-coded examples
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set \
        have side pockets?",
        "answer": "Yes",
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection",
    },
]

2.使用LLM生成示例

您也可以使用LLM本身来生成测试数据。LangChain 提供了 QAGenerateChain,它可以从您的文档自动生成查询-答案对。它是一个可以根据您的数据创建假设性问题和答案的AI助手。

代码语言:shell
复制
from langchain.evaluation.qa import QAGenerateChain
from pprint import pprint
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
new_examples = example_gen_chain.batch([{"doc": t} for t in data[:5]])
pprint(new_examples[0]["qa_pairs"])
# Output
# {'answer': "The approximate weight of the Women's Campside Oxfords per pair is "
#            '1 lb. 1 oz.',
#  'query': "What is the approximate weight of the Women's Campside Oxfords per "
#           'pair?'}

data[0]
# Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", 
#         metadata={'source': '/home/voldemort/Downloads/Code/Langchain_Harrison_Chase/Course_1/OutdoorClothingCatalog_1000.csv', 'row': 0})

通过结合手工制作的示例和LLM生成的示例,您可以快速构建一个强大的测试数据集。

代码语言:shell
复制
examples.extend([inst["qa_pairs"] for inst in new_examples])

三、手动评估和调试

有了测试数据,现在是时候评估你的LLM应用程序的性能了。最简单的方法是通过应用程序运行示例并检查最终输出。

代码语言:shell
复制
qa.invoke(examples[-1]["query"])
# Output
# Entering new RetrievalQA chain...
# Finished chain.
# {'query': 'What technology is used in the EcoFlex 3L Storm Pants to make them more breathable and keep the wearer dry and comfortable?',
#  'result': 'The EcoFlex 3L Storm Pants use TEK O2 technology to make them more breathable and keep the wearer dry and comfortable.'}

然而,这种方法可能有局限性,因为它无法提供有关应用程序流程中间步骤或潜在问题的洞察。

1.通过应用程序运行示例

为了更深入了解您的应用程序行为,LangChain提供了langchain.debug工具。当启用时,此实用程序会在应用程序执行的每个步骤中打印出详细信息,包括提示、上下文和中间结果。

代码语言:shell
复制
import langchain
langchain.debug = True
qa.invoke(examples[0]["query"])

通过检查这个输出,您可以识别检索或提示步骤中的潜在问题,从而让您更有效地找出并解决问题。

代码语言:shell
复制
"""
Output:

> Entering new RetrievalQA chain...
> Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
> Entering StuffDocumentsChain run with input:
[inputs]
> Entering LLMChain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 73\nname: Cozy Cuddles Knit Pullover Set\n...
}
[llm/start] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 73\nname: Cozy Cuddles Knit Pullover Set\n...
Human: Do the Cozy Comfort Pullover Set have side pockets?"
  ]
}
[llm/end] [1.89s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "Yes, the Cozy Comfort Pullover Set does have side pockets.",
        ...
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "completion_tokens": 14,
      "prompt_tokens": 733,
      "total_tokens": 747
    },
    "model_name": "gpt-3.5-turbo",
    "system_fingerprint": "fp_3b956da36b"
  },
  "run": null
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [1.89s] Exiting Chain run with output:
{
  "output_text": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
[chain/end] [2.36s] Exiting Chain run with output:
{
  "result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
}
"""

# Final Output:
# {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
#  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}

四、LLM辅助评估

虽然手动评估很有价值,但随着示例数量的增加,它可能会很快变得乏味和主观。这就是LLM辅助评估发挥作用的地方。

1.获取示例的预测

第一步是通过LLM应用程序运行您的示例并收集预测。

代码语言:shell
复制
predictions = qa.batch(inputs=examples)

2.使用QAEvalChain进行评分

LangChain提供了QAEvalChain,这是一个基于LLM的链,旨在评估您的应用程序预测的正确性。该链使用LLM理解语义相似性的能力,确保即使预测与预期答案不完全匹配,也能准确评分。

代码语言:python
代码运行次数:0
复制
from langchain.evaluation import QAEvalChain
llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

通过评分输出,您可以快速识别需要改进的领域,并对您的LLM应用程序进行迭代。

代码语言:python
代码运行次数:0
复制
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]["query"])
    print("Real Answer: " + predictions[i]["answer"])
    print("Predicted Answer: " + predictions[i]["result"])
    print("Predicted Grade: " + graded_outputs[i]["results"])
    print()

最终输出类似如下:

代码语言:shell
复制
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28", and the dimensions of the medium size are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: Some key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom, secure fit, and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps and fully lined bottom for a secure fit and coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear, V-Neck Tankini Contrasts is as follows:
- Body: 82% recycled nylon, 18% Lycra® spandex
- Lining: 90% recycled nylon, 10% Lycra® spandex
Predicted Grade: CORRECT

Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is a state-of-the-art air-permeable technology that offers the most breathability tested by the brand.
Predicted Grade: CORRECT
代码语言:python
代码运行次数:0
复制
graded_outputs[-1]
# {'results': 'CORRECT'}

小结

评估LLM应用程序是确保其可靠性和性能的关键步骤。通过利用类似LangChain的QAGenerateChain、langchain.debug、QAEvalChain和LangChain评估平台等工具,您可以简化评估过程,深入了解应用程序的行为,并更有效率地进行迭代。无论您是经验丰富的机器学习专业人员还是刚开始学习的人,这些框架和工具都可以帮助您发挥LLM应用程序的全部潜力。

小编是一名热爱人工智能的专栏作者,致力于分享人工智能领域的最新知识、技术和趋势。这里,你将能够了解到人工智能的最新应用和创新,探讨人工智能对未来社会的影响,以及探索人工智能背后的科学原理和技术实现。欢迎大家点赞,评论,收藏,让我们一起探索人工智能的奥秘,共同见证科技的进步!

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、创建QA应用程序
  • 二、构建测试数据
    • 1.手动创建示例
      • 2.使用LLM生成示例
      • 三、手动评估和调试
        • 1.通过应用程序运行示例
        • 四、LLM辅助评估
          • 1.获取示例的预测
            • 2.使用QAEvalChain进行评分
            • 小结
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档