特别注意：本文搬运自 github 项目 anthropics/courses/prompt_evaluations ，经过中文翻译，内容可能因为转译有少数表达上的改动，但不影响主要逻辑，仅供个人阅读学习使用

代码评分评估：分类任务

在这个课程中，我们将从头开始实现一个稍微复杂的代码评分评估，以测试客户投诉分类提示。我们的目标是编写一个能够可靠地将客户投诉分类到以下类别的提示：

软件错误
硬件故障
用户错误
功能请求
服务中断

例如，以下投诉文本：

网站完全宕机了，我无法访问任何页面

应该归类为 服务中断

在某些情况下，我们可能希望允许最多两个适用的分类类别，就像这个例子一样：

我想我安装了某些东西不正确，现在我的电脑完全无法启动

应该被分类为 User Error 和 Hardware Malfunction

评估数据集

我们将从定义我们的评估数据集输入和黄金答案开始。记住，通常我们希望评估数据集大约有 100 个输入，但为了使这些课程简单（并且快速且经济实惠），我们使用了一个精简的集合。

该测试集由一个字典列表组成，其中每个字典包含一个 投诉 和 黄金答案 键：

In [2]:

eval_data = [
    {
        "complaint": "The app crashes every time I try to upload a photo",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My printer isn't recognized by my computer",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I can't figure out how to change my password",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The website is completely down, I can't access any pages",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "It would be great if the app had a dark mode option",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The software keeps freezing when I try to save large files",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "My wireless mouse isn't working, even with new batteries",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "I accidentally deleted some important files, can you help me recover them?",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "None of your servers are responding, is there an outage?",
        "golden_answer": ["Service Outage"]
    },
    {
        "complaint": "Could you add a feature to export data in CSV format?",
        "golden_answer": ["Feature Request"]
    },
    {
        "complaint": "The app is crashing and my phone is overheating",
        "golden_answer": ["Software Bug", "Hardware Malfunction"]
    },
    {
        "complaint": "I can't remember my password!",
        "golden_answer": ["User Error"]
    },
    {
        "complaint": "The new update broke something and the app no longer works for me",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "I think I installed something incorrectly, now my computer won't start at all",
        "golden_answer": ["User Error", "Hardware Malfunction"]
    },
    {
        "complaint": "Your service is down, and I urgently need a feature to batch process files",
        "golden_answer": ["Service Outage", "Feature Request"]
    },
    {
        "complaint": "The graphics card is making weird noises",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "My keyboard just totally stopped working out of nowhere",
        "golden_answer": ["Hardware Malfunction"]
    },
    {
        "complaint": "Whenever I open your app, my phone gets really slow",
        "golden_answer": ["Software Bug"]
    },
    {
        "complaint": "Can you make the interface more user-friendly? I always get lost in the menus",
        "golden_answer": ["Feature Request", "User Error"]
    },
    {
        "complaint": "The cloud storage isn't syncing and I can't access my files from other devices",
        "golden_answer": ["Software Bug", "Service Outage"]
    }
]

一个初始提示

我们将从基本的提示开始，并衡量其表现。下面的提示生成函数接受一个 投诉 作为参数，并返回一个提示字符串：

In [3]:

def basic_prompt(complaint):
    return f"""
    Classify the following customer complaint into one or more of these categories: 
    Software Bug, Hardware Malfunction, User Error, Feature Request, or Service Outage.
    Only respond with the matching category or categories and nothing else.

    Complaint: {complaint}

    Classification:
    """

收集输出

接下来，我们将编写评估提示的逻辑。这个逻辑比我们之前课程中的“计数腿”示例要复杂一些：

In [4]:

from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def get_model_response(prompt, model_name):
    response = client.messages.create(
        model=model_name,
        max_tokens=200,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return response.content[0].text

def calculate_accuracy(eval_data, model_responses):
    correct_predictions = 0
    total_predictions = len(eval_data)
    
    for item, response in zip(eval_data, model_responses):
        golden_set = set(category.lower() for category in item["golden_answer"])
        prediction_set = set(category.strip().lower() for category in response.split(','))
        
        if golden_set == prediction_set:
            correct_predictions += 1
    
    return correct_predictions / total_predictions

def evaluate_prompt(prompt_func, eval_data, model_name):
    print(f"Evaluating with model: {model_name}")
    model_responses = [get_model_response(prompt_func(item['complaint']), model_name) for item in eval_data]
    accuracy = calculate_accuracy(eval_data, model_responses)
    
    print(f"Accuracy: {accuracy:.2%}")
    
    for item, response in zip(eval_data, model_responses):
        print(f"\nComplaint: {item['complaint']}")
        print(f"Golden Answer: {item['golden_answer']}")
        print(f"Model Response: {response}")
    return accuracy

evaluate_prompt 函数执行以下操作：

它将每个输入传递到我们的提示生成函数中，并使用 get_model_response 函数将生成的提示通过模型运行，同时收集生成的响应。
它通过将模型输出答案与数据集中的黄金答案进行比较来计算准确率。为此，它调用了 calculate_accuracy 函数。
calculate_accuracy 函数检查模型每个输出中是否包含适当的分类类别，使用一个 set。请记住，这并不是像我们之前的“计数腿”评估那样精确匹配的评估。
calculate_accuracy 返回一个准确度分数
evaluate_prompt 打印最终结果

注意，我们不再像上一课那样通过精确字符串匹配来评分，我们的评分逻辑使用一个 set 来检查模型输出中值的存在。

让我们用我们的初始 basic_prompt 来测试一下

In [5]:

evaluate_prompt(basic_prompt, eval_data, model_name="claude-3-haiku-20240307")

Evaluating with model: claude-3-haiku-20240307
Accuracy: 85.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I accidentally deleted some important files, can you help me recover them?
Golden Answer: ['User Error']
Model Response: User Error

Complaint: None of your servers are responding, is there an outage?
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: Could you add a feature to export data in CSV format?
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The app is crashing and my phone is overheating
Golden Answer: ['Software Bug', 'Hardware Malfunction']
Model Response: Hardware Malfunction
Software Bug

Complaint: I can't remember my password!
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The new update broke something and the app no longer works for me
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: I think I installed something incorrectly, now my computer won't start at all
Golden Answer: ['User Error', 'Hardware Malfunction']
Model Response: User Error, Hardware Malfunction

Complaint: Your service is down, and I urgently need a feature to batch process files
Golden Answer: ['Service Outage', 'Feature Request']
Model Response: Feature Request, Service Outage

Complaint: The graphics card is making weird noises
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: My keyboard just totally stopped working out of nowhere
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: Whenever I open your app, my phone gets really slow
Golden Answer: ['Software Bug']
Model Response: Hardware Malfunction

Complaint: Can you make the interface more user-friendly? I always get lost in the menus
Golden Answer: ['Feature Request', 'User Error']
Model Response: Feature Request

Complaint: The cloud storage isn't syncing and I can't access my files from other devices
Golden Answer: ['Software Bug', 'Service Outage']
Model Response: Software Bug, Service Outage

Out[5]: 0.85

一个改进的提示

我们最初的提示结果得到了85%的准确率。让我们对提示进行一些修改，然后重新运行评估，希望能得到更好的分数。

以下提示包含了关于类别的扩展说明，以及9组示例输入/输出对：

In [6]:

def improved_prompt(complaint):
    return f"""
    You are an AI assistant specializing in customer support issue classification. Your task is to analyze customer complaints and categorize them into one or more of the following categories:

    1. Software Bug: Issues related to software not functioning as intended.
    2. Hardware Malfunction: Problems with physical devices or components.
    3. User Error: Difficulties arising from user misunderstanding or misuse.
    4. Feature Request: Suggestions for new functionalities or improvements.
    5. Service Outage: System-wide issues affecting service availability.

    Important Guidelines:
    - A complaint may fall into multiple categories. If so, list all that apply but try to prioritize picking a single category when possible.

    Examples:
    1. Complaint: "The app crashes when I try to save my progress."
    Classification: Software Bug

    2. Complaint: "My keyboard isn't working after I spilled coffee on it."
    Classification: Hardware Malfunction

    3. Complaint: "I can't find the login button on your website."
    Classification: User Error

    4. Complaint: "It would be great if your app had a dark mode."
    Classification: Feature Request

    5. Complaint: "None of your services are loading for me or my colleagues."
    Classification: Service Outage

    6. Complaint "Complaint: The app breaks every time I try to change my profile picture"
    Classification: Software Bug

    7. Complaint "The app is acting buggy on my phone and it seems like your website is down, so I'm completely stuck!"
    Classification: Software Bug, Service Outage

    8. Complaint: "Your software makes my computer super laggy and awful, I hate it!"
    Classification: Software Bug

    9. Complaint: "Your dumb app always breaks when I try to do anything with images."
    Classification: 'Software Bug'

    Now, please classify the following customer complaint:

    <complaint>{complaint}</complaint>

    Only respond with the appropriate categories and nothing else.
    Classification:
    """

让我们用改进的提示运行评估：

In [80]:

evaluate_prompt(improved_prompt, eval_data, model_name="claude-3-haiku-20240307")

Evaluating with model: claude-3-haiku-20240307
Accuracy: 100.00%

Complaint: The app crashes every time I try to upload a photo
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My printer isn't recognized by my computer
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I can't figure out how to change my password
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The website is completely down, I can't access any pages
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: It would be great if the app had a dark mode option
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The software keeps freezing when I try to save large files
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: My wireless mouse isn't working, even with new batteries
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: I accidentally deleted some important files, can you help me recover them?
Golden Answer: ['User Error']
Model Response: User Error

Complaint: None of your servers are responding, is there an outage?
Golden Answer: ['Service Outage']
Model Response: Service Outage

Complaint: Could you add a feature to export data in CSV format?
Golden Answer: ['Feature Request']
Model Response: Feature Request

Complaint: The app is crashing and my phone is overheating
Golden Answer: ['Software Bug', 'Hardware Malfunction']
Model Response: Software Bug, Hardware Malfunction

Complaint: I can't remember my password!
Golden Answer: ['User Error']
Model Response: User Error

Complaint: The new update broke something and the app no longer works for me
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: I think I installed something incorrectly, now my computer won't start at all
Golden Answer: ['User Error', 'Hardware Malfunction']
Model Response: Hardware Malfunction, User Error

Complaint: Your service is down, and I urgently need a feature to batch process files
Golden Answer: ['Service Outage', 'Feature Request']
Model Response: Service Outage, Feature Request

Complaint: The graphics card is making weird noises
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: My keyboard just totally stopped working out of nowhere
Golden Answer: ['Hardware Malfunction']
Model Response: Hardware Malfunction

Complaint: Whenever I open your app, my phone gets really slow
Golden Answer: ['Software Bug']
Model Response: Software Bug

Complaint: Can you make the interface more user-friendly? I always get lost in the menus
Golden Answer: ['Feature Request', 'User Error']
Model Response: User Error, Feature Request

Complaint: The cloud storage isn't syncing and I can't access my files from other devices
Golden Answer: ['Software Bug', 'Service Outage']
Model Response: Software Bug, Service Outage

Out[80]: 1.0

我们用更新的改进后的提示获得了100%的准确率！

再次，我们遵循此图中概述的标准提示符 + 评估循环

请记住，这是一个非常简单的评估，使用了一个非常小的数据集。本课程旨在说明代码评分评估的一般过程，但它并非生产规模评估的典范！

这种方法可行，但从头编写评估逻辑有点繁琐，而且难以并排比较结果。
我们是否使用一个工具来生成格式良好的结果，并带有图表和图形，并轻松地在多个模型上运行评估？在下一课中，我们将看到这一点！
接下来，我们将查看一个评估框架，它使编写可重复、可扩展的生产用例评估变得容易。

anthropics的prompt评测教程4：代码分级分类评估

代码评分评估：分类任务

评估数据集

一个初始提示

收集输出

一个改进的提示