Lesson 16 · Intermediate

Chunking, Indexing, and Preparing Documents for RAG

Learn how document chunking affects retrieval and why indexing strategy matters for usable RAG systems.

Read the explanation carefully, then review the examples and coding section. The goal is to understand both the concept and how it appears inside a real application workflow.

Explanation

Chunking splits source material into smaller pieces that can be embedded and retrieved efficiently.

Chunks should preserve enough context to be meaningful but stay focused enough for precise retrieval.

Good indexing captures both content and metadata, such as source, section title, date, and access level.

Why this topic matters in practice

In generative AI products, the model is only one part of the system. The surrounding workflow determines whether the output is useful, safe, and maintainable. This lesson matters because it helps you connect the idea to tasks such as tutoring, search, copilots, business assistants, and production automation.

Examples

Policy manual

Chunk by section and subsection so answers can reference clear policy units.

Course lessons

Chunk by heading so queries can retrieve a specific concept rather than a whole lesson.

Technical docs

Attach version metadata so answers prefer the current release.

A basic fixed-size chunker

The code below is intentionally concise so the underlying pattern stays clear. It focuses on the application logic you can reuse, even if you later switch model providers or deployment environments.

def chunk_text(text, size=120, overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

sample = "Generative AI can support drafting, summarization, retrieval, tutoring, coding, and automation."
for i, chunk in enumerate(chunk_text(sample, size=35, overlap=8), start=1):
    print(f"Chunk {i}: {chunk}")

How the coding section works

This example chunks by character count, but production systems often chunk by sentence, heading, or token count.
Overlap helps preserve context between neighboring chunks.
Chunking strategy should match the structure of the source material.

Implementation advice

When turning this lesson into a real feature, think beyond the code snippet itself. Decide what inputs should be allowed, how you will validate outputs, how you will recover from errors, and how you will measure whether the feature is actually helping users. Those surrounding choices often determine whether an AI feature feels polished or unreliable.

Summary / key takeaways

Chunking has a major impact on retrieval quality.
Metadata helps retrieve the right content and explain where answers came from.
There is no universal chunk size; testing matters.

Exercises

Why is overlap useful in chunking?
Give one reason a chunk might be too large and one reason it might be too small.
Describe how you would chunk a tutorial lesson for a study assistant.