特别注意：本文搬运自 github 项目 anthropics/courses/prompt_evaluations ，经过中文翻译，内容可能因为转译有少数表达上的改动，但不影响主要逻辑，仅供个人阅读学习使用

Promptfoo：分类评估

注意：本课程位于包含相关代码文件的文件夹中。如果您想跟上进度并自行运行评估，请下载整个文件夹

在之前的课程中，我们评估了用于分类客户投诉的提示，例如：

每次我打开你的应用，我的手机就会变得非常慢

和

我不知道怎么修改我的密码

分为五个不同的类别：

软件错误
硬件故障
用户错误
服务中断
功能请求

在本课中，我们将这个提示评估移植到 promptfoo，使其更容易批量运行和比较结果。

初始化 promptfoo

第一步是使用命令初始化 promptfoo：

npx promptfoo@latest init

正如我们在上一课中所见，这会创建一个 promptfooconfig.yaml 文件。我们可以删除现有内容。

接下来，我们将配置我们的提供者。将以下内容添加到 promptfooconfig.yaml：

description: "Complaint Classification Eval"
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

我们将使用 Claude 3 Haiku 来节省 API 成本，因为我们将在这个课程中多次运行此评估。

确保您已设置 ANTHROPIC_API_KEY 环境变量。您可以通过在终端中运行此命令来设置环境变量：

export ANTHROPIC_API_KEY=your_api_key_here

准备提示词

接下来，我们将收集提示词并确保 promptfoo 知道它们。我们将遵循我们在上一个视频中看到的相同模式：

我们将每个提示词制作为一个 Python 函数。
每个提示函数都会返回一个提示字符串。
我们所有的提示函数都将位于一个 prompts.py 文件中。

创建一个名为 prompts.py 的新文件，并将以下提示函数添加到该文件中。这是我们最初在投诉分类课程中编写的相同两个提示：

def basic_prompt(complaint):
    return f"""
    Classify the following customer complaint into one or more of these categories: 
    Software Bug, Hardware Malfunction, User Error, Feature Request, or Service Outage.
    Only respond with the classification.

    Complaint: {complaint}

    Classification:
    """

def improved_prompt(complaint):
    return f"""
    You are an AI assistant specializing in customer support issue classification. Your task is to analyze customer complaints and categorize them into one or more of the following categories:

    1. Software Bug: Issues related to software not functioning as intended.
    2. Hardware Malfunction: Problems with physical devices or components.
    3. User Error: Difficulties arising from user misunderstanding or misuse.
    4. Feature Request: Suggestions for new functionalities or improvements.
    5. Service Outage: System-wide issues affecting service availability.

    Important Guidelines:
    - A complaint may fall into multiple categories. If so, list all that apply but try to prioritize picking a single category when possible.

    Examples:
    1. Complaint: "The app crashes when I try to save my progress."
    Classification: Software Bug

    2. Complaint: "My keyboard isn't working after I spilled coffee on it."
    Classification: Hardware Malfunction

    3. Complaint: "I can't find the login button on your website."
    Classification: User Error

    4. Complaint: "It would be great if your app had a dark mode."
    Classification: Feature Request

    5. Complaint: "None of your services are loading for me or my colleagues."
    Classification: Service Outage

    6. Complaint "Complaint: The app breaks every time I try to change my profile picture"
    Classification: Software Bug

    7. Complaint "The app is acting buggy on my phone and it seems like your website is down, so I'm completely stuck!"
    Classification: Software Bug, Service Outage

    8. Complaint: "Your software makes my computer super laggy and awful, I hate it!"
    Classification: Software Bug

    9. Complaint: "Your dumb app always breaks when I try to do anything with images."
    Classification: 'Software Bug'

    Now, please classify the following customer complaint:

    <complaint>{complaint}</complaint>

    Only respond with the appropriate categories and nothing else.
    Classification:
    """

接下来，我们需要告诉 promptfoo 我们想要使用这两个提示。更新 promptfooconfig.yaml 文件：

description: "Complaint Classification Eval"

prompts:
  - prompts.py:basic_prompt
  - prompts.py:improved_prompt
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

准备我们的评估测试集

最后一步是将我们的评估数据集调整为与 promptfoo 兼容的格式。提醒一下，这是我们之前课程中的原始 eval_data Python 列表的样子：

eval_data = [
    {
        "complaint": "The app crashes every time I try to upload a photo",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My printer isn't recognized by my computer",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I can't figure out how to change my password",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The website is completely down, I can't access any pages",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "It would be great if the app had a dark mode option",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The software keeps freezing when I try to save large files",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My wireless mouse isn't working, even with new batteries",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I accidentally deleted some important files, can you help me recover them?",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "None of your servers are responding, is there an outage?",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "Could you add a feature to export data in CSV format?",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The app is crashing and my phone is overheating",
        "golden_answer": ["Software Bug", "Hardware Malfunction"]
    },
    {
        "complaint": "I can't remember my password!",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The new update broke something and the app no longer works for me",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "I think I installed something incorrectly, now my computer won't start at all",
        "golden_answer": ["User Error", "Hardware Malfunction"]
    },
    {
        "complaint": "Your service is down, and I urgently need a feature to batch process files",
        "golden_answer": ["Service Outage", "Feature Request"]
    },
    {
        "complaint": "The graphics card is making weird noises",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "My keyboard just totally stopped working out of nowhere",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "Whenever I open your app, my phone gets really slow",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "Can you make the interface more user-friendly? I always get lost in the menus",
        "golden_answer": ["Feature Request", "User Error"]
    },
    {
        "complaint": "The cloud storage isn't syncing and I can't access my files from other devices",
        "golden_answer": ["Software Bug", "Service Outage"]
    }
]

正如我们在上一课中所做的那样，我们将数据集转换为 CSV 文件。这里的关键区别在于我们的评估逻辑不再是一个简单的精确匹配。
为了评分这个评估，我们希望 promptfoo 确保每个模型输出都包含正确的分类

例如，给定这个数据集的行：

{
    "complaint": "The cloud storage isn't syncing and I can't access my files from other devices",
    "golden_answer": ["Software Bug", "Service Outage"]
}

我们将编写一个提示，该提示将接受这个输入 投诉 ：

云存储没有同步，我无法从其他设备访问我的文件

对于上述示例，我们需要使用 promptfoo 来确保模型的响应包含“软件错误”和“服务中断”。我们不能进行精确匹配。如果模型切换这两个分类的输出顺序会怎样？
感谢的是，promptfoo 提供了一组内置断言，我们可以利用这些断言。这些断言包括诸如：

contains - 输出包含子字符串
contains-all - 输出包含所有子字符串列表
包含任意 - 输出包含所列子字符串中的任意一个
包含 json - 输出包含有效的 json（可选 json 模式验证）
包含 sql - 输出包含有效的 sql
包含 xml - 输出包含有效的 xml
等于 - 输出完全匹配
包含 - 输出包含子字符串，不区分大小写
全部包含 - 输出包含所有子字符串列表，不区分大小写
包含任意 - 输出包含所列子字符串中的任意一个，不区分大小写
正则表达式 - 输出匹配正则表达式
许多其他

查看内置指标的完整列表。

对于我们的用例，我们将使用 包含所有 来确保给定的输出包含所有适当的分类标签。

一种加载和结构化 promptfoo 评估数据集的方式是通过 CSV。如之前所见，我们可以提供一个特殊的 CSV 列名 __expected 来指定评分逻辑。在这个列中，我们可以使用上述任何内置断言，包括 contains-all

创建一个名为 dataset.csv 的新文件，并将以下代码粘贴到其中：

complaint,__expected
The app crashes every time I try to upload a photo,contains-all:Software Bug
My printer isn't recognized by my computer,contains-all:Hardware Malfunction
I can't figure out how to change my password,contains-all:User Error
The website is completely down I can't access any pages,contains-all:Service Outage
It would be great if the app had a dark mode option,contains-all:Feature Request
The software keeps freezing when I try to save large files,contains-all:Software Bug
My wireless mouse isn't working even with new batteries,contains-all:Hardware Malfunction
I accidentally deleted some important files can you help me recover them?,contains-all:User Error
None of your servers are responding is there an outage?,contains-all:Service Outage
Could you add a feature to export data in CSV format?,contains-all:Feature Request
"The app is crashing and my phone is overheating","contains-all:Software Bug,Hardware Malfunction"
I can't remember my password!,contains-all:User Error
The new update broke something and the app no longer works for me,contains-all:Software Bug
"I think I installed something incorrectly now my computer won't start at all","contains-all:User Error,Hardware Malfunction"
"Your service is down and I urgently need a feature to batch process files","contains-all:Service Outage,Feature Request"
The graphics card is making weird noises,contains-all:Hardware Malfunction
My keyboard just totally stopped working out of nowhere,contains-all:Hardware Malfunction
Whenever I open your app my phone gets really slow,contains-all:Software Bug
Can you make the interface more user-friendly? I always get lost in the menus,"contains-all:Feature Request,User Error"
The cloud storage isn't syncing and I can't access my files from other devices,"contains-all:Software Bug,Service Outage"

我们的 CSV 包含两列：

投诉 - 实际输入的投诉
__预期 - 包含一个包含所有断言

查看其中的一行，比如这一行：

“您的服务中断了，我急需一个批量处理文件的功能”,“contains-all:服务中断,功能请求”

此数据集的此行指定，对于输入”您的服务中断了，我急需一个批量处理文件”，我们希望 promptfoo 检查模型的输出并确保它包含”服务中断”和”功能请求”

最后一步是更新我们的 promptfooconfig.yaml 文件以包含我们刚刚编写的测试。该文件现在应如下所示：

description: "Complaint Classification Eval"

prompts:
  - prompts.py:basic_prompt
  - prompts.py:improved_prompt
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

tests: dataset.csv

运行评估

要运行评估，我们将使用之前看到的相同命令：

npx promptfoo@latest eval

这是我们第一次运行上述评估时得到的输出：

很明显，包含示例的改进提示比我们最初的简单提示表现更好。示例帮助模型理解我们可能希望输出包含多个类别分配的情况：

和往常一样，我们也可以使用这个命令打开评估结果的交互式网络：

npx promptfoo@latest view

我们可以看到，我们的 basic_prompt 正确率达到了 80%，而 improved_prompt 得到了 100 分。

总是请记住，我们使用的是小型教育数据集，这些数据集不能代表现实世界的评估。我们始终建议评估数据集至少有100行。

接下来，我们将看看如何在 promptfoo 中编写自定义评分逻辑！

anthropics的prompt评测教程6：promptfoo的代码评分分类

Promptfoo：分类评估

初始化 promptfoo

准备提示词

准备我们的评估测试集

运行评估