特别注意：本文搬运自 github 项目 anthropics/courses/prompt_evaluations ，经过中文翻译，内容可能因为转译有少数表达上的改动，但不影响主要逻辑，仅供个人阅读学习使用

介绍 promptfoo

注意：本课程位于包含相关代码文件的文件夹中。如果您想跟上进度并自行运行评估，请下载整个文件夹

我们已经看到了如何从头开始编写自己的评估，这可以很有效，但有点繁琐。通常更实用的是利用专门为此目的设计的工具。
如今有许多评估工具和库可用（而且还在不断发布新的！），包括：

一个开源且易于使用的选项是 promptfoo。Promptfoo 提供了一个开箱即用的解决方案，可以显著减少全面测试提示所需的时间和精力。
它提供了一个简单、现成的批量测试、版本控制和性能分析的基础设施，使开发者能够更专注于优化他们的提示，而不是构建和维护测试框架。
它使得跨多个提示、模型和提供商运行评估变得容易，并且还提供了工具来轻松可视化和比较评估结果。
Promptfoo 和其他评估工具比从零开始编写自己的评估逻辑有了巨大的改进！

在我们运行评估后，promptfoo 将生成一个类似于图片中的仪表板：

让我们开始吧！

我们的第一个 promptfoo eval

本课程接下来的几节课将专注于使用 promptfoo 来编写评估。在第一节课中，我们将学习一种简单的方法，使用 promptfoo 来评估我们之前几节课中的“这个动物有多少条腿？”提示。
这是一个非常简单的提示和评估。这里的重点是使用 promptfoo 运行评估的实际工具和流程。

作为提醒，在那节课中我们使用了这个小型评估数据集：

eval_data = [
    {"animal_statement": "The animal is a human.", "golden_answer": "2"},
    {"animal_statement": "The animal is a snake.", "golden_answer": "0"},
    {"animal_statement": "The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.", "golden_answer": "5"},
    {"animal_statement": "The animal is a dog.", "golden_answer": "4"},
    {"animal_statement": "The animal is a cat with two extra legs.", "golden_answer": "6"},
    {"animal_statement": "The animal is an elephant.", "golden_answer": "4"},
    {"animal_statement": "The animal is a bird.", "golden_answer": "2"},
    {"animal_statement": "The animal is a fish.", "golden_answer": "0"},
    {"animal_statement": "The animal is a spider with two extra legs", "golden_answer": "10"},
    {"animal_statement": "The animal is an octopus.", "golden_answer": "8"},
    {"animal_statement": "The animal is an octopus that lost two legs and then regrew three legs.", "golden_answer": "9"},
    {"animal_statement": "The animal is a two-headed, eight-legged mythical creature.", "golden_answer": "8"},
]

在那节课中，我们写了三个不同的提示，它们从我们的初级评估函数中获得了逐渐提高的准确率分数。在这节课中，我们将评估数据集和提示移植到 promptfoo 中，看看运行和比较它们的输出有多容易。

安装 promptfoo

使用 promptfoo 的第一步是通过命令行安装它。导航到你要编写评估代码的文件夹，并运行以下命令：

npx promptfoo@latest init

这将在你的当前目录中创建一个 promptfooconfig.yaml 文件。这个文件就是所有神奇操作发生的地方。在它里面，我们配置以下内容：

我们在评估中要使用的提供者（Anthropic API 模型）
我们想要评估的提示
我们想要运行的测试

配置提供者¶

接下来，我们可以配置 promptfoo 以使用我们想要运行评估的特定 Anthropic API 模型。为此，我们在 promptfooconfig.yaml 文件中指定一个 providers 字段，并将其设置为一个或多个 Anthropic 模型。Promptfoo 使用特定的模式来指定模型名称。当前支持的 Anthropic 模型字符串有：

anthropic:messages:claude-3-5-sonnet-20240620
anthropic:messages:claude-3-haiku-20240307
anthropic:messages:claude-3-sonnet-20240229
anthropic:messages:claude-3-opus-20240229
anthropic:messages:claude-2.0
anthropic:messages:claude-2.1
anthropic:messages:claude-instant-1.2

我们将使用 Haiku 进行这次第一个评估。删除 promptfooconfig.yaml 文件中的现有内容，并用以下内容替换它：

description: "Animal Legs Eval"
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

下面是每个部分的作用说明：

description 是一个可选的标签，用于描述我们正在评估的任务。
providers 告诉 promptfoo 我们希望使用 Haiku 进行此评估。我们可以指定多个模型，正如我们将在未来的课程中看到的那样。

Promptfoo 在运行评估时会查找名为 ANTHROPIC_API_KEY 的环境变量。您可以通过在命令行中运行此命令来设置环境变量：

export ANTHROPIC_API_KEY=your_api_key_here

指定我们的提示符¶

下一步是告诉 promptfoo 我们想要评估的提示符。有多种方法可以做到这一点，包括：

将提示符直接作为文本放在 YAML 文件中
从 JSON 文件加载提示
从文本文件加载提示
从另一个 YAML 文件加载提示
从 Python 文件加载提示

我们倾向于将所有相关的提示放在一个 Python 文件中，作为单独的函数来返回提示字符串。在后面的课程中，我们将看到其他替代方法。Promptfoo 非常灵活，正如你在这门课程中将要看到的那样！

创建一个名为 prompts.py 的 Python 文件，并将以下提示函数放在它里面：

def simple_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    
    How many legs does the animal have? Please respond with a number"""

def better_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    
    How many legs does the animal have? Please only respond with a single digit like 2 or 9"""

请注意，这些函数中的每一个都接受一个 animal_statement 参数，将其插入到提示中，然后返回最终的提示字符串

下一步是告诉 promptfoo 配置文件我们想要从刚才创建的 prompts.py 文件中加载提示。为此，请更新 promptfooconfig.yaml 文件以包含以下代码：

description: "Animal Legs Eval"

prompts:
  - prompts.py:simple_prompt
  - prompts.py:better_prompt
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

注意，我们为 prompts.py 文件中添加的每个提示函数都添加了单独的一行。我们已经告诉 promptfoo 我们希望评估两个提示：simple_prompt 和 better_prompt，这两个提示都“存在于” prompts.py 文件中。

配置我们的测试¶

下一步是告诉 promptfoo 我们想要用特定的提示和提供者运行哪些特定的测试。Promptfoo 为我们定义测试提供了许多选项，但我们将从最常见的方法之一开始：在 CSV 文件中指定我们的测试。

我们将创建一个名为 dataset.csv 的新 CSV 文件，并将我们的测试输入写入其中。

Promptfoo 允许我们将评估逻辑直接定义在 CSV 文件中。
在接下来的课程中，我们将看到 promptfoo 提供的一些内置测试断言，但对于这个特定的评估，我们只需要在模型的输出和预期输出（腿的数量）之间查找一个确切的字符串匹配。

为此，我们将用两个列标题来编写我们的 CSV：

动物陈述 - 包含输入的动物陈述，例如“动物是一头大象”
__预期__ - 包含预期的正确输出（注意双下划线在__预期__中）。这是 promptfoo 特定的语法。

创建一个 dataset.csv 文件，并向其中添加以下内容：

animal_statement,__expected
"The animal is a human.","2"
"The animal is a snake.","0"
"The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.","5"
"The animal is a dog.","4"
"The animal is a cat with two extra legs.","6"
"The animal is an elephant.","4"
"The animal is a bird.","2"
"The animal is a fish.","0"
"The animal is a spider with two extra legs","10"
"The animal is an octopus.","8"
"The animal is an octopus that lost two legs and then regrew three legs.","9"
"The animal is a two-headed, eight-legged mythical creature.","8"

最后，我们将告诉 promptfoo 使用我们的 dataset.csv 文件来加载测试。为此，请更新 promptfooconfig.yaml 文件以包含此代码：

description: "Animal Legs Eval"

prompts:
  - prompts.py:simple_prompt
  - prompts.py:better_prompt
  
providers:
  - "anthropic:messages:claude-3-haiku-20240307"

tests: animal_legs_tests.csv

运行我们的评估

现在我们已经指定了提供者、提示和测试，是时候运行评估了！

在终端中运行以下命令：

npx promptfoo@latest eval

这将启动评估过程。对于我们的每个提示，promptfoo 将会：

从 CSV 文件中获取每个 animal_statement
构建包含 animal_statement 的完整提示
使用单个提示向 Anthropic API 发送请求
检查输出是否与 CSV 文件中的预期输出匹配

评估完成后，promptfoo 将在终端中显示结果

这是从运行上述两段代码中得到的示例 promptfoo 输出：

上述截图仅包括前四行，但评估是在所有十二个输入上运行的。

左列显示了特定的 animal_statement
中间列显示了 simple_prompt 的输出和分数，它在每个测试用例中都似乎失败了！
右侧列出了 better_prompt 的输出和得分，它在大多数测试用例中都能成功，但在逻辑上比较复杂的情况下会失败。

查看评估结果¶

Promptfoo 可以轻松启动一个仪表板，以便在浏览器中可视化和检查评估结果。运行上述评估后，请在终端中运行此命令：

npx promptfoo@latest view

这将询问您是否要启动服务器（输入 ‘y’），然后在浏览器中打开仪表板。

最相关的摘要信息在顶部：

我们也可以针对特定结果来分析它们失败的原因。让我们来看看其中一个 simple_prompt 结果（中间列）。对于这个提示，每一行都被标记为失败。这是怎么回事？

点击单元格中的放大镜按钮以了解更多：

这会打开一个包含有关输出和评分的详细信息的模态框：

我们可以清楚地看到，这个 简单提示 得到了正确的答案 0，但输出包括一些额外的、不必要的解释性文本，导致它无法通过评估。

如果我们仔细查看最右边的列，其中包含我们 更好提示 提示的结果，我们会得到更好的响应，这些响应都是单个数字，如 5 或 0。它似乎在更复杂的 动物陈述 上失败了，这些陈述需要更多的推理才能回答，例如：

狐狸失去了一条腿，然后神奇地长回了失去的腿，并在上面长出了一条神秘的额外腿。

添加第三个提示¶

回顾早期关于代码评分评估的课程，我们最终通过在提示中添加一些思维链推理得到了最佳结果。让我们添加一个包含思维链的改进的第三个提示，看看它在“更棘手”的问题上表现如何！

添加以下提示函数到 prompts.py：

def chain_of_thought_prompt(animal_statement):
    return f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
    
    Here is the animal statement.
    <animal_statement>{animal_statement}</animal_statement>
    
    How many legs does the animal have? 
    Start by reasoning about the numbers of legs the animal has, thinking step by step inside of <thinking> tags.  
    Then, output your final answer inside of <answer> tags. 
    Inside the <answer> tags return just the number of legs as an integer and nothing else."""

我们将看到类似这样的输出，现在包括4列：

我们可以使用浏览器再次查看结果：

npx promptfoo@latest view

我们将看到一个看起来像这样的网页：

我们可以清楚地看到，包含思维链的提示词能够100%正确回答问题！

比较模型

promptfoo 的一个很棒的特点是，它很容易用不同的模型运行评估。
为了让提示在 Haiku 上得分达到 100%，我们不得不做一些提示工程的工作，但让我们看看如果我们决定切换到一个更强大的模型，比如 Claude 3.5 Sonnet，会发生什么。

我们只需要更新我们的 promptfooconfig.yaml 文件，以包含一个与有效的 Anthropic 提供者字符串匹配的第二提供者。更新 promptfooconfig.yaml 文件，以包含两个提供者：

description: "Animal Legs Eval"

prompts:
  - prompts.py:simple_prompt
  - prompts.py:better_prompt
  - prompts.py:chain_of_thought_prompt
  
providers:
  - anthropic:messages:claude-3-haiku-20240307
  - anthropic:messages:claude-3-5-sonnet-20240620

tests: animal_legs_tests.csv

defaultTest:
  options:
    transform: file://transform.py

然后我们可以用之前相同的命令再次运行我们的评估：

npx promptfoo@latest eval

当我们在网页版仪表板上查看时，我们看到一些有趣的结果！

只需在 YAML 文件中添加一行，我们就能在两个模型上运行我们的评估集。前三个输出列是 Claude 3 Haiku 的输出，最后三个是 Claude 3.5 Sonnet 的输出。
看起来，Claude 3.5 Sonnet 在我们的评估中达到了 100%，即使使用 简单提示 ，而 Claude 3 Haiku 在这个提示下得分仅为 0%。

这种信息非常宝贵：不仅要知道哪个提示表现最好，还要知道对于特定任务，哪个模型+提示组合表现最好。

侧面提示： 如果你想知道为什么 Claude 3.5 Sonnet 在思维链提示中没有得到 100%，这里有一个解释！它在测试中错误地处理了 动物陈述 “动物是章鱼。” 在它的 <thinking> 标签内，Claude 3.5 Sonnet 推理认为章鱼实际上没有腿，而是有通常被称为”手臂”但从不被称为”腿”的附肢。
通过升级到”更聪明”的模型，我们在思维链提示中实际上看到了稍差的性能，因为模型”太聪明了”。如果我们想要确保所有模型上的性能，我们可以更新提示，使其更具体地说明什么实际上可以被视为”腿”。

这节课只是我们对 promptfoo 的初步体验。在未来的课程中，我们将学习如何处理更复杂的代码评分逻辑，定义我们自己的自定义评分器，并运行模型评分评估。

anthropics的prompt评测教程5：promptfoo的动物分类代码