Lesson 18 ยท Intermediate

Evaluating Generative AI Systems

Learn how to evaluate answer quality, groundedness, consistency, usefulness, and task success in generative AI applications.

Read the explanation carefully, then review the examples and coding section. The goal is to understand both the concept and how it appears inside a real application workflow.

Explanation

Evaluation must be tied to the use case: a tutoring bot is judged differently from a code assistant or a summarizer.

Useful evaluation dimensions include factuality, relevance, completeness, formatting, and user satisfaction.

Strong teams combine automated checks with human review.

Why this topic matters in practice

In generative AI products, the model is only one part of the system. The surrounding workflow determines whether the output is useful, safe, and maintainable. This lesson matters because it helps you connect the idea to tasks such as tutoring, search, copilots, business assistants, and production automation.

Examples

Support answers

Check whether answers reflect policy and whether required escalation language appears.

Summaries

Evaluate whether key points are preserved without major distortion.

Extraction tasks

Measure whether structured outputs are complete and correctly formatted.

A simple rubric-based evaluator

The code below is intentionally concise so the underlying pattern stays clear. It focuses on the application logic you can reuse, even if you later switch model providers or deployment environments.

def evaluate_answer(answer):
    score = 0
    if len(answer) > 50:
        score += 1
    if "example" in answer.lower():
        score += 1
    if "I do not know" in answer:
        score += 1
    return score

sample_answer = "Machine learning finds patterns in data. For example, it can predict house prices."
print("Rubric score:", evaluate_answer(sample_answer))

How the coding section works

  • This is a toy rubric, but it introduces the idea of explicit evaluation criteria.
  • Automated checks are useful for volume, but humans are still needed for nuance.
  • The most important step is defining what 'good' means for your application.

Implementation advice

When turning this lesson into a real feature, think beyond the code snippet itself. Decide what inputs should be allowed, how you will validate outputs, how you will recover from errors, and how you will measure whether the feature is actually helping users. Those surrounding choices often determine whether an AI feature feels polished or unreliable.

Summary / key takeaways

  • Evaluation is essential because fluent output is not the same as useful output.
  • A good rubric reflects the real goal of the application.
  • Measure systematically rather than relying on intuition.

Exercises

  1. Design three evaluation criteria for a school FAQ assistant.
  2. Why might a long answer still be a bad answer?
  3. Create a simple rubric for an AI-generated summary.