特别注意：本文搬运自 github 项目 anthropics/courses/prompt_evaluations ，经过中文翻译，内容可能因为转译有少数表达上的改动，但不影响主要逻辑，仅供个人阅读学习使用

使用 promptfoo 进行模型评分评估

注意：本课程位于包含相关代码文件的文件夹中。如果您想跟上进度并自行运行评估，请下载整个文件夹

到目前为止，我们只写了代码评分评估。在可能的情况下，代码评分评估是最简单、成本最低的评估方式。
它们基于预定义标准提供明确、客观的评估，非常适合具有简单、可量化结果的任务。
问题在于，代码评分评估只能评估某些类型的输出，主要是那些可以简化为精确匹配、数值比较或其他可编程逻辑的输出。

然而，许多现实世界中的语言模型应用需要更细致的评估。
假设我们想为中学课堂构建一个聊天机器人。我们可能需要评估其输出，以确保它们使用适合该年龄段的语言，保持教育性，避免回答非学术问题，或提供适合中学生理解复杂度的解释。
这些标准是主观且依赖于上下文的，这使得它们难以用传统的基于代码的方法进行评估。这就是模型评分评估发挥作用的地方！

模型评分评估利用大型语言模型的能力，根据更复杂、更细致的标准来评估输出。
通过使用另一个模型作为评估器，我们可以利用与生成原始响应相同水平的语言理解和上下文感知能力。
这种方法使我们能够创建更复杂的评估指标，这些指标可以考虑语气、相关性、适当性，甚至创造力等因素——这些通常是代码评分系统无法触及的方面。

模型评分评估背后的核心思想是将评估本身视为一项自然语言处理任务。我们向评估模型提供以下某种组合：

原始提示或问题
我们想要评估的模型生成的响应
一套评估标准或指南
如何评估和给响应打分的说明

这种方法允许对输出进行更全面的评估，不仅考虑事实准确性，还包括风格元素、遵循特定指南以及响应在其预期用途中的整体质量。

常见的模型评分评估技术包括询问模型：

这个回应有多抱歉？
给定提供的上下文，这个回应是否事实准确？
这个回应是否过多地提到了其上下文/信息？
这个回答是否真正适当地回答了问题？
这个输出是否符合我们的语气/品牌/风格指南？

在这个课程中，我们将使用 promptfoo 编写自己的简单模型评分评估。

Mdel-graded evals with promptfoo

As with most things in promptfoo, there are multiple valid approaches to writing model-graded evaluations. In this lesson we’ll see the simplest pattern: utilizing built-in assertions. In the next lesson, we’ll see how to write our own custom model-graded assertion functions.

To start, we’ll use a built-in assertion called llm-rubric, which is promptfoo’s general-purpose grader for “LLM as a judge” evaluations. Using it is as simple as adding the following to your promptfooconfig.yaml file:

assert:
  - type: llm-rubric
    # The model we want to use as the grader
    provider: anthropic:messages:claude-3-opus-20240229
    # Specify the criteria for grading the LLM output:
    value: Is not apologetic

The above assertion will use Claude 3 Opus to grade the output based on whether or not the response is apologetic.

Let’s try using llm-rubric in our own evaluation!

编写自己的评估

在本课中，我们将专注于评估面向中学生学术助手的提示。我们正在构建一个聊天机器人，它应该回答与学校科目相关的问题，但应避免回答不相关的问题。
我们将从这样一个简单的提示开始：

You are an interactive tutor assistant for middle school children.
Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
Only answer questions related to middle school academics.
This is the student question: {{question}}

我们将编写一个模型评分评估，以确保模型拒绝回答类似“我怎样才能让父母给我买手机？”或“有些有趣的免费视频游戏是什么？”的问题。理论上，我们可以尝试编写一个代码评分评估，该评估使用正则表达式来查找关键的拒绝短语，如“我很抱歉”、“我无法帮忙”等，但这是一种非常脆弱的方法。
使用模型来评分响应会得到更好的结果。

初始化 promptfoo

总是如此，第一步是初始化 promptfoo。在一个新文件夹中，运行以下命令：

npx promptfoo@latest init

基本配置

为了让事情更简单，我们不会像以前那样将我们的提示放在一个单独的文件中。我们将提示和测试直接写在 promptfooconfig.yaml 文件中。删除文件中的任何现有内容，并用以下内容替换：

description: "School Assistant Eval"

prompts:
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    This is the student question {{question}}

providers: [anthropic:messages:claude-3-5-sonnet-20240620]
  # - anthropic:messages:claude-3-haiku-20240307

defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Refuses to answer the question and instead redirects to academic topics'

tests:
  - vars:
      question: What's the best free mobile video game?
  - vars:
      question: How can I convince my parents to let me stay up later?
  - vars:
      question: What are some fun theme ideas for a sleepover?
  - vars:
      question: What are some good prank ideas that won't get me in trouble?
  - vars:
      question: Who are some famous left footed soccer players?

上述配置包括以下内容：

prompts - 包含我们想要评估的初始基本提示。请记住，我们通常将提示放在 Python 文件中，但我们也有直接在配置文件中内联它们的选项。
提供者 - 我们告诉 promptfoo 我们希望将我们的提示通过 Claude 3.5 Sonnet 运行
默认测试 - 在 promptfoo 运行的每个测试中，我们希望使用 llm-rubric 对其进行评分，以确保输出拒绝回答问题，而是将讨论引导回学术话题。在这种情况下，我们告诉 promptfoo 使用 Claude 3 Opus 来执行评分
测试 - 我们希望运行的测试集。我们通常在 CSV 文件中指定这些测试，但我们也可以在 YAML 文件中内联编写测试。这些测试都包含我们希望模型拒绝回答的问题

下一步是运行我们的评估：

npx promptfoo@latest eval

这是我们第一次运行此评估时生成的输出：

我们的提示词与大部分评估数据集（尽管这是一个非常小的数据集）配合得很好，但它看起来模型很乐意回答有关足球运动员的问题。
以下是从 promptfoo 网页视图中截取的屏幕截图，展示了模型的响应以及评分模型的评分逻辑：

让我们尝试添加一个更详细的第二个提示，看看是否能让模型坚持学术主题。

第二个提示¶

让我们添加第二个提示，其中包含一些关于模型应讨论的确切主题的指南：

你是一个面向中学生的互动辅导助手。学生将向你提问，你的工作是回答他们，解释内容要适合中学生理解。只回答与中学学术相关的问题。
你应该回答的合适主题包括：数学、阅读、科学、外语、社会科学和艺术。在学术背景下，拒绝回答与上述主题无关的问题。这是学生的问题：{{question}}

更新 promptfooconfig.yaml 文件，使其看起来像这样：

description: "School Assistant Eval"

prompts:
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    This is the student question: {{question}}
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    Acceptable topics you should respond to include: math, reading, science, foreign languages, social students, and art.
    Refuse to answer questions not related to the above topics in an academic context.
    This is the student question: {{question}}


providers: [anthropic:messages:claude-3-5-sonnet-20240620]

defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Refuses to answer the question and instead redirects to academic topics'

tests:
  - vars:
      question: What's the best free mobile video game?
  - vars:
      question: How can I convince my parents to let me stay up later?
  - vars:
      question: What are some fun theme ideas for a sleepover?
  - vars:
      question: What are some good prank ideas that won't get me in trouble?
  - vars:
      question: Who are some famous left footed soccer players?

我们现在有两个提示正在评估！让我们再次运行评估：

npx promptfoo@latest eval

这是我们得到的结果：

看起来我们对第二个提示所做的更改有影响！请注意，这个评估数据集太小，在现实世界中没有用。

对道歉进行评分

仔细查看模型输出后，我们发现它们大多数以道歉开头，如“我很抱歉”或“我道歉”。这对我们的用户来说并不是一种理想的体验，因此我们决定尝试改进！我们想要评估第三个提示：

你是一个面向中学生的互动辅导助手。学生向你提问，你的工作是回答适合中学生理解水平的问题。只回答与中学学术相关的问题。
你应该回答的主题包括：数学、阅读、科学、外语、社会科学和艺术。在学术背景下，拒绝回答与上述主题无关的问题。拒绝回答时不要道歉或使用道歉的语气。
相反，温和地引导学生关注与学校相关的话题。这是学生的问题：{{question}}

上述提示明确指示模型避免道歉，而是专注于温和地引导学生关注学术话题。

接下来，让我们添加第二个 llm-rubric 断言来测试模型的输出是否带有道歉。更新 promptfooconfig.yaml 以使其看起来像这样：

description: "School Assistant Eval"

prompts:
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    This is the student question: {{question}}
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    Acceptable topics you should respond to include: math, reading, science, foreign languages, social students, and art.
    Refuse to answer questions not related to the above topics in an academic context.
    This is the student question: {{question}}
  - >- 
    You are an interactive tutor assistant for middle school children.
    Students will ask you a question and your job is to respond with explanations that are understandable to a middle school audience.
    Only answer questions related to middle school academics.
    Acceptable topics you should respond to include: math, reading, science, foreign languages, social students, and art.
    Refuse to answer questions not related to the above topics in an academic context.
    Do not apologize or use an apologetic tone when refusing.  Instead, gently nudge students to focus on school-related topics.
    This is the student question: {{question}}


providers: [anthropic:messages:claude-3-5-sonnet-20240620]

defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Refuses to answer the question and instead redirects to academic topics'
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Is not apologetic'

tests:
  - vars:
      question: What's the best free mobile video game?
  - vars:
      question: How can I convince my parents to let me stay up later?
  - vars:
      question: What are some fun theme ideas for a sleepover?
  - vars:
      question: What are some good prank ideas that won't get me in trouble?
  - vars:
      question: Who are some famous left footed soccer players?

我们现在有三个提示正在测试。对于每个测试用例，我们使用一个模型来评估两个不同的方面：

模型应该拒绝回答这个问题
模型不应该道歉

让我们运行评估：

npx promptfoo@latest eval

让我们使用以下命令启动网页视图：

npx promptfoo@latest view

记得我们可以点击放大镜图标来查看每个模型输出的更多细节以及相应的断言等级。让我们更仔细地看看第一行的第二个条目：

我们可以看到输出通过了原始的模型评分断言，并且确实拒绝回答了离题的问题。我们还可以看到输出失败了我们所添加的第二个断言，因为”The response begins with ‘I’m sorry’, which is an apologetic phrase.”

现在让我们聚焦于第一行的第三个条目：

这个输出通过了所有的断言！

请记住，这个数据集对于现实的评估来说太小了。

Promptfoo 的内置模型评分断言非常有用，但我们可能需要更多控制来精确地管理模型评分指标和过程。在接下来的课程中，我们将学习如何定义自己的自定义模型评分函数！

anthropics的prompt评测教程8：promptfoo的模型评级