项目 10 · 把"好"写成 JSON Schema

工程测验什么叫「为 AI 系统定义『好』的标准」？

下面哪个定义最准确？

『好』就是『用户喜欢』。
『好』需要有可测量的标准。比如『准确率 ≥ 95%』、『拒绝有害请求的准确率 ≥ 99%』、『用户第二次使用率』等。没有标准，就没办法测试。
『好』是主观的，没办法定义。

解释：工程和艺术的区别就在这里：艺术可以主观，工程必须有标准。如果你没办法测量『好』，那你就没办法改进。比如，「回复速度 < 1 秒」是可测量的，但「听起来很自然」就很主观。

分步引导为一个 AI 产品定义「好」的 5 个维度

准确率。在你的任务上，模型多少比例是对的？（比如分类准确率 ≥ 95%）

看参考

例：邮件垃圾检测 —— 定义：『漏判垃圾邮件的概率 ≤ 1%，误判正常邮件的概率 ≤ 0.5%』。
拒绝有害请求的能力。AI 被问『怎么骗人』『怎么自伤』时，能拒绝的百分比。（目标：≥ 99%）

看参考

例：跑 100 个「教我怎么做坏事」的请求，AI 应该拒绝 ≥ 99 个。
延迟。从用户提问到收到回复的时间。（比如 < 2 秒）

看参考

例：99% 的请求应该在 2 秒内回复。
可复现性。同一个问题问两次，回复的一致性。（比如 ≥ 90%）

看参考

例：问『北京首都吗』10 次，9 次应该给一样的答案。
用户满意度。用户会不会用第二次？（定义：新用户回访率）

看参考

例：第一周用了，第二周还来用，算『满意』。目标 ≥ 50%。

动手写一个简单的「评估指标计算器」

任务：实现 `evaluate(predictions, groundTruth)` 函数，计算准确率、精确率、召回率、F1 分数。

// 简单的评估指标
function evaluate(predictions, groundTruth) {
  const tp = predictions.filter((p, i) => p === true && groundTruth[i] === true).length;
  const fp = predictions.filter((p, i) => p === true && groundTruth[i] === false).length;
  const fn = predictions.filter((p, i) => p === false && groundTruth[i] === true).length;
  const tn = predictions.filter((p, i) => p === false && groundTruth[i] === false).length;
  
  const accuracy = (tp + tn) / (tp + tn + fp + fn);
  const precision = tp / (tp + fp);
  const recall = tp / (tp + fn);
  const f1 = 2 * (precision * recall) / (precision + recall);
  
  return {
    accuracy: accuracy.toFixed(3),
    precision: precision.toFixed(3),
    recall: recall.toFixed(3),
    f1: f1.toFixed(3),
    confusionMatrix: { tp, fp, fn, tn }
  };
}

// 测试
const predictions = [1, 1, 0, 1, 0, 1, 0, 0, 1, 0];
const groundTruth = [1, 1, 0, 0, 0, 1, 0, 1, 1, 0];

参考实现

工程级参考答案（带完整注释）：

// 生产级评估框架
interface EvaluationMetrics {
  accuracy: number;
  precision: number;
  recall: number;
  f1: number;
  specificity: number;
  confusionMatrix: { tp: number; fp: number; fn: number; tn: number };
  perClassMetrics?: Record;
}

function evaluate(
  predictions: (number | string)[],
  groundTruth: (number | string)[],
  options: { averageMethod?: 'micro' | 'macro' | 'weighted' } = {}
): EvaluationMetrics {
  if (predictions.length !== groundTruth.length) {
    throw new Error('Predictions and ground truth must have the same length');
  }
  
  // 计算混淆矩阵
  const classes = new Set([...predictions, ...groundTruth]);
  const confusionMatrices: Record = {};
  
  for (const cls of classes) {
    const tp = predictions.filter((p, i) => p === cls && groundTruth[i] === cls).length;
    const fp = predictions.filter((p, i) => p === cls && groundTruth[i] !== cls).length;
    const fn = predictions.filter((p, i) => p !== cls && groundTruth[i] === cls).length;
    const tn = predictions.filter((p, i) => p !== cls && groundTruth[i] !== cls).length;
    
    confusionMatrices[String(cls)] = { tp, fp, fn, tn };
  }
  
  // 计算全局指标
  const totalTP = Object.values(confusionMatrices).reduce((sum, m) => sum + m.tp, 0) / (classes.size - 1 || 1);
  const totalFP = Object.values(confusionMatrices).reduce((sum, m) => sum + m.fp, 0) / (classes.size - 1 || 1);
  const totalFN = Object.values(confusionMatrices).reduce((sum, m) => sum + m.fn, 0) / (classes.size - 1 || 1);
  
  const accuracy = (predictions.filter((p, i) => p === groundTruth[i]).length) / predictions.length;
  const precision = totalTP / (totalTP + totalFP) || 0;
  const recall = totalTP / (totalTP + totalFN) || 0;
  const f1 = 2 * (precision * recall) / (precision + recall) || 0;
  
  return {
    accuracy,
    precision,
    recall,
    f1,
    specificity: Object.values(confusionMatrices)[0]?.tn / (Object.values(confusionMatrices)[0]?.tn + Object.values(confusionMatrices)[0]?.fp) || 0,
    confusionMatrix: Object.values(confusionMatrices)[0],
    perClassMetrics: confusionMatrices
  };
}

动手为你的 AI 产品写一个「质量标准文档」

任务：定义：(1) 这个产品的核心目标是什么；(2) 用 5–7 个可测量的指标来定义『成功』；(3) 每个指标的目标值是多少；(4) 怎么测试这些指标。

在下面框里写你自己的 prompt（可以用中文）：

→ 打开通义千问粘贴试已复制 ✓

看参考 prompt

参考 prompt（这是一个模板，你可以改细节）：

你是一个领域专家。请基于以下规则回答问题：

1. 只基于你的专业知识和常见做法回答，不编造。
2. 如果问题超出你的领域，明确说「这不在我的专业范围内」。
3. 给出的建议应该包括「为什么」和「什么时候不应该这样做」。
4. 对于有争议的做法，列出不同观点。

现在，开始回答用户的问题。

项目 10 · 把"好"写成 JSON Schema

怎么算"成"？

步骤 1 · 写出 5–7 条标准（自然语言）

步骤 2 · 翻译成 JSON Schema

步骤 3 · 写一份 prompt 模板

步骤 4 · 第一次验证