特别注意:本文搬运自 github 项目 anthropics/courses/prompt_evaluations ,经过中文翻译,内容可能因为转译有少数表达上的改动,但不影响主要逻辑,仅供个人阅读学习使用
简单的代码评分评估
在这节课中,我们将从一个非常简单的代码评分评估示例开始,下一节课将涵盖一个更现实的提示。我们将遵循这个图中概述的流程:
大致步骤是:
- 首先定义我们的评估测试集
- 写下我们的初始提示尝试
- 运行它通过我们的评估流程并获得分数
- 根据评估结果修改我们的提示
- 运行修改后的提示通过我们的评估流程,并希望获得更好的分数!
让我们尝试遵循这个流程!
我们的输入数据
我们将评估一个任务,其中要求 Claude 成功识别动物有多少条腿。在未来的课程中,我们将看到更复杂和现实的提示和评估,但在这里我们故意保持简单,以专注于实际的评估过程。
第一步是编写我们的评估数据集,该数据集包括我们的输入以及相应的黄金答案。让我们使用这个简单的字典列表,其中每个字典都有一个动物陈述
和黄金答案
键:
In [1]:
eval_data = [
{"animal_statement": "The animal is a human.", "golden_answer": "2"},
{"animal_statement": "The animal is a snake.", "golden_answer": "0"},
{"animal_statement": "The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.", "golden_answer": "5"},
{"animal_statement": "The animal is a dog.", "golden_answer": "4"},
{"animal_statement": "The animal is a cat with two extra legs.", "golden_answer": "6"},
{"animal_statement": "The animal is an elephant.", "golden_answer": "4"},
{"animal_statement": "The animal is a bird.", "golden_answer": "2"},
{"animal_statement": "The animal is a fish.", "golden_answer": "0"},
{"animal_statement": "The animal is a spider with two extra legs", "golden_answer": "10"},
{"animal_statement": "The animal is an octopus.", "golden_answer": "8"},
{"animal_statement": "The animal is an octopus that lost two legs and then regrew three legs.", "golden_answer": "9"},
{"animal_statement": "The animal is a two-headed, eight-legged mythical creature.", "golden_answer": "8"},
]
注意到有些评估问题有点棘手,比如这个:
狐狸失去了一条腿,然后神奇地长回了失去的腿,并在上面长出了一条神秘的额外腿。
这将在后面很重要!
我们的初始提示
接下来,我们将定义我们的初始提示。下面的函数接收一个动物陈述,并返回一个包含我们第一次提示尝试的格式正确的消息列表:
In [2]:
def build_input_prompt(animal_statement):
user_content = f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
Here is the animal statement.
<animal_statement>{animal_statement}</animal_statement>
How many legs does the animal have? Please respond with a number"""
messages = [{'role': 'user', 'content': user_content}]
return messages
让我们快速用我们 eval
数据集中的第一个元素来测试一下
In [3]:
build_input_prompt(eval_data[0]['animal_statement'])
Out[3]:
[{'role': 'user',
'content': 'You will be provided a statement about an animal and your job is to determine how many legs that animal has.\n \n Here is the animal statement.\n <animal_statement>The animal is a human.</animal_statement>\n \n How many legs does the animal have? Please respond with a number'}]
接下来,我们将编写一个简单的函数,该函数接受一个消息列表并将其发送到 Anthropic API:
In [5]:
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
MODEL_NAME = "claude-3-haiku-20240307"
def get_completion(messages):
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=messages
)
return response.content[0].text
让我们用 `eval_data` 列表中的第一个条目来测试它,该列表包含以下动物声明:
‘The animal is a human.’
In [6]:
full_prompt = build_input_prompt(eval_data[0]['animal_statement'])
get_completion(full_prompt)
Out [6]:
'2'
我们得到 2
作为响应,这通过了肉眼测试!人类通常有两条腿。下一步是构建并运行整个包含我们 eval_data
集合中所有 12 个条目的评估。
编写评估逻辑
我们将从我们的 eval_data
列表中的每个输入与提示模板进行组合,将生成的“完成”提示传递给模型,并收集所有返回的输出:
In [93]:
outputs = [get_completion(build_input_prompt(question['animal_statement'])) for question in eval_data]
让我们快速查看一下我们得到的内容:
Out[94]:
['2',
'0',
'5',
'4',
'6',
'4',
'Based on the provided animal statement, "The animal is a bird.", the animal has 2 legs.\n\nResponse: 2',
'0',
'8',
'An octopus has 8 legs.',
'5',
'8']
已经可以看出我们的提示需要改进,因为我们得到了一些不是纯数字的答案!让我们与每个相应的黄金答案一起更仔细地查看结果:
In [95]:
for output, question in zip(outputs, eval_data):
print(f"Animal Statement: {question['animal_statement']}\nGolden Answer: {question['golden_answer']}\nOutput: {output}\n")
Animal Statement: The animal is a human.
Golden Answer: 2
Output: 2
Animal Statement: The animal is a snake.
Golden Answer: 0
Output: 0
Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.
Golden Answer: 5
Output: 5
Animal Statement: The animal is a dog.
Golden Answer: 4
Output: 4
Animal Statement: The animal is a cat with two extra legs.
Golden Answer: 6
Output: 6
Animal Statement: The animal is an elephant.
Golden Answer: 4
Output: 4
Animal Statement: The animal is a bird.
Golden Answer: 2
Output: Based on the provided animal statement, "The animal is a bird.", the animal has 2 legs.
Response: 2
Animal Statement: The animal is a fish.
Golden Answer: 0
Output: 0
Animal Statement: The animal is a spider with two extra legs
Golden Answer: 10
Output: 8
Animal Statement: The animal is an octopus.
Golden Answer: 8
Output: An octopus has 8 legs.
Animal Statement: The animal is an octopus that lost two legs and then regrew three legs.
Golden Answer: 9
Output: 5
Animal Statement: The animal is a two-headed, eight-legged mythical creature.
Golden Answer: 8
Output: 8
这是一个足够小的数据集,我们可以轻松扫描结果并找到有问题的响应,但让我们系统地评估我们的结果:
In [97]:
def grade_completion(output, golden_answer):
return output == golden_answer
grades = [grade_completion(output, question['golden_answer']) for output, question in zip(outputs, eval_data)]
print(f"Score: {sum(grades)/len(grades)*100}%")
Score: 66.66666666666666%
我们现在有一个基准分数!在这种情况下,我们最初的提示得到了66.6%的准确率。在扫描上述结果后,看起来我们当前的输出有两个明显的问题:
问题1:输出格式问题
我们的目标是编写一个提示,使其结果为数字。我们的一些输出不是数字:
Animal Statement: The animal is a bird.
Golden Answer: 2
Output: Based on the provided animal statement, "The animal is a bird.", the animal has 2 legs.
我们可以通过提示来解决这个问题!
问题 1:答案错误
此外,有些答案完全错误:
Animal Statement: The animal is an octopus that lost two legs and then regrew three legs.
Golden Answer: 9
Output: 5
和
Animal Statement: The animal is a spider with two extra legs
Golden Answer: 10
Output: 8
这些输入有点“棘手”,似乎在给模型造成一些问题。我们也会尝试通过提示来修复这个问题!
我们的第二次尝试
既然我们已经在初始提示下获得了一定的基线性能,那么让我们尝试改进提示,看看我们的评估分数是否有所提高。
我们将首先解决模型有时会输出额外文本而不是仅以数字响应的问题。这里是一个第二个生成提示的函数:
In [98]:
def build_input_prompt2(animal_statement):
user_content = f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
Here is the animal statement.
<animal_statement>{animal_statement}</animal_statement>
How many legs does the animal have? Respond only with a numeric digit, like 2 or 6, and nothing else."""
messages = [{'role': 'user', 'content': user_content}]
return messages
提示中的关键补充是这一行:
仅用数字回答,如2或6,不要其他任何内容。
让我们用这个更新的提示测试每个输入:
In [99]:
outputs2 = [get_completion(build_input_prompt2(question['animal_statement'])) for question in eval_data]
我们将快速查看输出:
Out[101]:
['2', '0', '6', '4', '6', '4', '2', '0', '8', '8', '5', '8']
我们现在开始获得专一的数字输出!让我们更仔细地看看结果:
In [102]:
for output, question in zip(outputs2, eval_data):
print(f"Animal Statement: {question['animal_statement']}\nGolden Answer: {question['golden_answer']}\nOutput: {output}\n")
Animal Statement: The animal is a human.
Golden Answer: 2
Output: 2
Animal Statement: The animal is a snake.
Golden Answer: 0
Output: 0
Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.
Golden Answer: 5
Output: 6
Animal Statement: The animal is a dog.
Golden Answer: 4
Output: 4
Animal Statement: The animal is a cat with two extra legs.
Golden Answer: 6
Output: 6
Animal Statement: The animal is an elephant.
Golden Answer: 4
Output: 4
Animal Statement: The animal is a bird.
Golden Answer: 2
Output: 2
Animal Statement: The animal is a fish.
Golden Answer: 0
Output: 0
Animal Statement: The animal is a spider with two extra legs
Golden Answer: 10
Output: 8
Animal Statement: The animal is an octopus.
Golden Answer: 8
Output: 8
Animal Statement: The animal is an octopus that lost two legs and then regrew three legs.
Golden Answer: 9
Output: 5
Animal Statement: The animal is a two-headed, eight-legged mythical creature.
Golden Answer: 8
Output: 8
实际的数字答案仍然存在明显的问题,比如这一个:
Animal Statement: The animal is a spider with two extra legs
Golden Answer: 10
Output: 8
在处理那个问题之前,让我们先得到一个官方的分数,看看我们的表现(希望)是否有所提高:
In [103]:
grades = [grade_completion(output, question['golden_answer']) for output, question in zip(outputs2, eval_data)]
print(f"Score: {sum(grades)/len(grades)*100}%")
Score: 75.0%
我们的分数有所上升! 注意:这个数据集相当小,所以请对这些结果持保留态度
我们的第三次尝试
接下来,让我们解决我们看到的逻辑问题,比如不正确的输出:
Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.
Golden Answer: 5
Output: 6
在这里,我们可以采用思维链提示技术,其中我们给 Claude 特定的指令,让它通过推理来生成最终答案。
现在我们有了评估方法,我们可以测试思维链提示是否真的有影响!
让我们编写一个新的提示,要求模型在 <thinking>
标签内“大声思考”。这使我们的逻辑稍微复杂一些,因为我们需要一种方便的方法来提取模型的最终答案。我们将指示模型也将它的最终答案包含在 <answer>
标签内,以便我们能够轻松地提取“最终”的数字答案:
In [105]:
def build_input_prompt3(animal_statement):
user_content = f"""You will be provided a statement about an animal and your job is to determine how many legs that animal has.
Here is the animal statement.
<animal_statement>{animal_statement}</animal_statement>
How many legs does the animal have?
Start by reasoning about the numbers of legs the animal has, thinking step by step inside of <thinking> tags.
Then, output your final answer inside of <answer> tags.
Inside the <answer> tags return just the number of legs as an integer and nothing else."""
messages = [{'role': 'user', 'content': user_content}]
return messages
让我们使用这个新版本的提示来收集输出:
outputs3 = [get_completion(build_input_prompt3(question['animal_statement'])) for question in eval_data]
现在让我们看看一些输出:
In [110]:
for output, question in zip(outputs3, eval_data):
print(f"Animal Statement: {question['animal_statement']}\nGolden Answer: {question['golden_answer']}\nOutput: {output}\n")
Animal Statement: The animal is a human.
Golden Answer: 2
Output: <thinking>
The animal is a human, and based on this information, we can reasonably conclude that a human has 2 legs. Humans are bipedal, meaning they have two legs that they use for locomotion and standing upright. This is a characteristic of the human species.
</thinking>
<answer>2</answer>
Animal Statement: The animal is a snake.
Golden Answer: 0
Output: <thinking>
The animal stated in the given statement is a snake. Snakes are known to be legless reptiles, as they do not have any legs. They move by slithering on the ground using their body and scales.
</thinking>
<answer>0</answer>
Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.
Golden Answer: 5
Output: Here is my step-by-step reasoning:
<thinking>
1. The initial statement says the fox lost a leg.
2. But then the fox "magically grew back the leg he lost and a mysterious extra leg on top of that."
3. This means the fox originally had 4 legs, lost 1 leg, and then grew back the lost leg plus an extra leg, for a total of 5 legs.
</thinking>
<answer>5</answer>
Animal Statement: The animal is a dog.
Golden Answer: 4
Output: <thinking>
The animal statement says the animal is a dog. Dogs are quadrupeds, meaning they have four legs. Therefore, the number of legs the dog has is 4.
</thinking>
<answer>4</answer>
Animal Statement: The animal is a cat with two extra legs.
Golden Answer: 6
Output: <thinking>
The animal statement says that the animal is a cat with two extra legs. A typical cat has four legs, so with the two extra legs, the animal must have six legs in total.
</thinking>
<answer>6</answer>
Animal Statement: The animal is an elephant.
Golden Answer: 4
Output: <thinking>
An elephant is a large mammal that belongs to the order Proboscidea. Elephants are known to have four legs, one for each of their four limbs. Therefore, based on the given animal statement, the animal is an elephant, and elephants have four legs.
</thinking>
<answer>4</answer>
Animal Statement: The animal is a bird.
Golden Answer: 2
Output: <thinking>
The statement provided indicates that the animal is a bird. Birds typically have two legs, as they are bipedal animals that walk and perch on two legs.
</thinking>
<answer>2</answer>
Animal Statement: The animal is a fish.
Golden Answer: 0
Output: <thinking>
Based on the given animal statement, the animal is a fish. Fish are aquatic vertebrates that typically have fins and gills to help them swim and breathe in the water. Fish do not have legs, as they move through the water using their fins and tails.
</thinking>
<answer>0</answer>
Animal Statement: The animal is a spider with two extra legs
Golden Answer: 10
Output: <thinking>
The animal statement says that the animal is a spider with two extra legs.
A spider typically has 8 legs, so with two extra legs, the total number of legs would be 8 + 2 = 10 legs.
</thinking>
<answer>10</answer>
Animal Statement: The animal is an octopus.
Golden Answer: 8
Output: <thinking>
The animal statement says the animal is an octopus. An octopus is a marine invertebrate with eight tentacles that are often referred to as legs. Therefore, the animal has 8 legs.
</thinking>
<answer>8</answer>
Animal Statement: The animal is an octopus that lost two legs and then regrew three legs.
Golden Answer: 9
Output: <thinking>
The animal is described as an octopus that lost two legs and then regrew three legs. Initially, an octopus has eight legs.
Since the animal lost two legs, it would have had six legs remaining.
Then, the animal regrew three legs, so the final number of legs the animal has is nine.
</thinking>
<answer>9</answer>
Animal Statement: The animal is a two-headed, eight-legged mythical creature.
Golden Answer: 8
Output: <thinking>
The animal statement mentions that the animal is a two-headed, eight-legged mythical creature. This means that the animal has two heads and eight legs.
</thinking>
<answer>8</answer>
这里是我们收到的一种响应示例:
Animal Statement: The fox lost a leg, but then magically grew back the leg he lost and a mysterious extra leg on top of that.
Golden Answer: 5
Output: Here is my step-by-step reasoning:
<thinking>
1. The initial statement says the fox lost a leg.
2. But then the fox "magically grew back the leg he lost and a mysterious extra leg on top of that."
3. This means the fox originally had 4 legs, lost 1 leg, and then grew back the lost leg plus an extra leg, for a total of 5 legs.
</thinking>
<answer>5</answer>
看起来逻辑有所改进,至少在这个特定的例子中是这样。现在我们需要专注于使这个提示“可评分”。在评分之前,我们需要提取 answer
标签之间的数字。
这里有一个函数,可以提取两个 <answer>
标签之间的文本:
In [111]:
import re
def extract_answer(text):
pattern = r'<answer>(.*?)</answer>'
match = re.search(pattern, text)
if match:
return match.group(1)
else:
return None
接下来,让我们从最新的输出批次中提取答案:
In [112]:
extracted_outputs3 = [extract_answer(output) for output in outputs3]
Out[113]:
['2', '0', '5', '4', '6', '4', '2', '0', '10', '8', '9', '8']
接下来,让我们获取我们的分数,看看在提示中添加思维链是否有所差异!
In [114]:
grades3 = [grade_completion(output, question['golden_answer']) for output, question in zip(extracted_outputs3, eval_data)]
print(f"Score: {sum(grades3)/len(grades3)*100}%")
Score: 100.0%
我们将分数提升到了100%!
我们的评估让我们有信心,我们做出的提示更改确实产生了更好的输出。这是一个使用精确匹配评分的简单示例,但在下一课我们将看看一个稍微复杂一点的东西。