特别注意：本文搬运自 github 项目 anthropics/courses/prompt_evaluations ，经过中文翻译，内容可能因为转译有少数表达上的改动，但不影响主要逻辑，仅供个人阅读学习使用

Anthropic 工作台评估

本课程将向您展示如何使用 Anthropic 工作台来运行您自己的人工评估。这是一个非常易于使用和可视化的界面，可以快速原型设计提示并运行人工评估。
虽然我们通常建议在生产评估中使用更可扩展的方法，但 Anthropic 工作台是进行人工评估的绝佳起点，然后再转向更严格的代码评估或模型评估。

在本课程中，我们将学习如何使用工作台来测试提示、运行简单评估以及比较提示版本。

Anthropic 工作台

Anthropic 的工作台是一个快速原型设计和运行人工评估的绝佳场所。这是我们首次加载工作台时的样子：

在左侧我们可以输入提示。让我们想象一下，我们正在开发一个代码翻译应用程序，并希望编写最佳的提示来使用 Anthropic API 将任何编程语言的代码翻译成 Python。这是一个初始的提示尝试：

You are a skilled programmer tasked with translating code from one programming language to Python. Your goal is to produce an accurate and idiomatic Python translation of the provided source code.

Here is the source code to translate:

<source_code>
{{SOURCE_CODE}}
</source_code>

The source code is written in the following language:

<source_language>
{{SOURCE_LANGUAGE}}
</source_language>

Please translate this code to Python

注意 {{SOURCE_CODE}} 和 {{SOURCE_LANGUAGE}} 变量，稍后我们将用动态值替换它们。

我们可以把这个提示放到工作台的左侧：

接下来，我们可以通过点击变量（{ }）按钮来为我们的变量设置测试值：

这将打开一个对话框，要求我们为 {{SOURCE_CODE}} 和 {{SOURCE_LANGUAGE}} 变量输入值：

接下来，我们可以点击运行，查看模型生成的结果：

工作台评估

逐一测试我们的提示变量是一个好的开始，但工作台也提供了一个内置的评估工具来帮助我们针对多个输入运行提示。要切换到评估视图，请点击顶部的“评估”切换按钮：

这将打开评估视图，并预填充我们的初始结果：

接下来，我们可以点击“添加行”按钮来添加一些新的测试用例。让我们添加两个新的测试用例：一些 Ruby 代码和一些 C# 代码：

接下来，我们可以点击每个测试用例旁边的“运行”按钮，或者点击橙色的“运行剩余”按钮来运行所有未运行的测试用例：

让我们点击“运行剩余”按钮，看看我们的模型响应：

这是我们得到的结果：

人工评分

现在是仔细查看模型输出并给出分数的时候了。在右侧列中，我们有给每个输出分配分数的选项：

让我们更仔细地看看一些输出。

JavaScript 测试用例结果

这是我们 JavaScript 翻译测试用例的结果：

Here's the Python translation of the provided JavaScript code:

chicken_count = 99


In Python:
1. We don't use the `const` keyword to declare constants. By convention, constants in Python are typically written in all uppercase, but this isn't enforced by the language.
2. We don't need to use any keyword (like `let` or `var` in JavaScript) to declare variables.
3. The variable name uses snake_case, which is already consistent with Python naming conventions.

If you want to emphasize that this is intended to be a constant (i.e., a value that shouldn't be changed), you could use the all-uppercase convention:


CHICKEN_COUNT = 99


Both versions are valid Python code, with the uppercase version signaling to other developers that this value is intended to be constant.

Ruby 测试用例结果

这是我们 Ruby 翻译测试用例的结果：

Certainly! Here's the Python translation of the provided Ruby code:


class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def birthday(self):
        self.age += 1

    def introduce(self):
        print(f"Hello, I'm {self.name} and I'm {self.age} years old.")

    @classmethod
    def create_family(cls, members):
        return [cls(name, age) for name, age in members]

# Create a family
family = Person.create_family([
    ["Alice", 35],
    ["Bob", 40],
    ["Charlie", 12]
])

# Introduce family members
for person in family:
    person.introduce()

# Celebrate Charlie's birthday
charlie = next(person for person in family if person.name == "Charlie")
charlie.birthday()
charlie.introduce()


Here are the key changes and explanations:

1. In Python, we don't need to explicitly declare attribute readers. The attributes are accessible by default.

2. The `initialize` method in Ruby becomes `__init__` in Python.

3. The `birthday!` method is renamed to `birthday` in Python, as the exclamation mark is not typically used in Python method names.

4. The `self.` prefix is used explicitly in Python method definitions and when accessing instance attributes.

5. String interpolation uses f-strings in Python (e.g., `f"Hello, I'm {self.name}"`).

6. The `create_family` class method is defined using the `@classmethod` decorator in Python.

7. List comprehension is used instead of `map` for creating the family members.

8. The `each` method with `&:introduce` is replaced by a simple `for` loop in Python.

9. The `find` method is replaced with `next()` and a generator expression to find Charlie.

This Python code maintains the functionality of the original Ruby code while adhering to Python's syntax and conventions.

评分提示

我们目前的所有输出在翻译代码方面都做得相当不错，但是存在一些关键问题：

我们不需要那些令人烦恼的序幕，比如“当然！这是提供的 Ruby 代码的 Python 翻译。”这些只是浪费输出 token！
当前格式在程序上很难解析。我们该如何编写代码来分别提取翻译后的 Python 代码？
我们不需要输出末尾的长篇解释。对于我们的用例，我们只需要翻译后的代码。

让我们继续对输出进行评分。我们将给它们每个分配3分（满分5分）。

更新提示

下一步是修改我们的提示并再次运行评估！让我们更新提示以反映我们之前确定的问题。

You are a skilled programmer tasked with translating code from one programming language to Python. Your goal is to produce an accurate and idiomatic Python translation of the provided source code.

Here is the source code to translate:

<source_code>
{{SOURCE_CODE}}
</source_code>

The source code is written in the following language:

<source_language>
{{SOURCE_LANGUAGE}}
</source_language>

Please translate this code to Python.
Format your response as follows:

<python_code>
Your Python translation here
</python_code>

Only output the <python_code> tags without any other text content

如果我们切换回“提示”视图，可以在界面中更新提示：

绿色高亮文本展示了我们对提示所做的添加

接下来，我们点击橙色的“运行”按钮来测试我们的新提示。这是新的结果：

这正是我们希望看到的！响应不包含引言或扩展的代码解释。

重新运行评估

接下来，我们可以切换回评估视图：

注意在左上角我们看到“v2”，表明这是我们提示的第二版本。让我们点击“运行剩余”并查看其他测试用例的输出：

新的输出看起来很棒！它们都跳过了任何开场白和解释性文本，并且只包含 <python_code> 标签，这些标签包含翻译后的 Python 代码。让我们继续将这些全部标记为得分为 5/5！

比较结果

既然我们已经尝试了两个不同的提示，我们可以并排比较结果。我们可以点击右上角的“+ 添加比较”按钮，并选择我们提示的一个旧版本（v1），以将我们的 v2 结果与它进行比较。
这将展示两个提示的模型输出和人工评分，并排显示：

显然，我们可以看到我们的 v2 提示对于我们的特定用例效果更好！

工作台及其评估工具是一个快速原型设计和横向比较结果的理想环境。它是开始评估之旅的理想环境，然后再转向更强大的解决方案。
在接下来的课程中，我们将看到如何自动化更大规模的代码评分和模型评分课程。

anthropics的prompt评测教程2：工作台评估