RAG Basics

Goal

Understand Retrieval-Augmented Generation, or RAG, in a simple developer-focused way.

After this lesson, you should be able to explain:

what RAG means,
why RAG is useful,
how retrieval and generation work together,
what a basic RAG architecture looks like,
how documents become searchable context,
what can go wrong in a RAG system,
how to build and test a small RAG example.

What is RAG

RAG means Retrieval-Augmented Generation.

Retrieval-Augmented Generation (RAG) combines information retrieval with language generation to produce more accurate, context-aware responses. It uses two components: a retriever, which searches a database to find relevant information, and a generator, which crafts a response based on the retrieved data. Implementing RAG involves using a retrieval model (e.g., embeddings and vector search) alongside a generative language model (like GPT). The process starts by converting a query into embeddings, retrieving relevant documents from a vector database, and feeding them to the language model, which then generates a coherent, informed response. This approach grounds outputs in real-world data, resulting in more reliable and detailed answers.

Retrieval  = search for relevant information
Augmented  = add that information to the prompt
Generation = let the LLM write the final answer

The simple idea:

Do not ask the model to guess.
Give the model useful source text first.
Then ask it to answer.

AWS diagram showing how RAG retrieves external knowledge before sending context to a foundation model

Source image: AWS SageMaker documentation.

Why RAG Matters

LLMs have limits:

They do not automatically know your private documents.
They may not know new information after training.
They can forget details in long conversations.
They can sound confident even when they are wrong.
They cannot fit every document into the context window.

RAG helps because it gives the model relevant outside information on demand.

RAG can help a system:

answer from company documents,
use newer information,
reduce guessing,
cite real sources,
avoid expensive fine-tuning for every knowledge update,
answer questions about data the base model never saw during training.

The Basic Picture

User asks a question
        |
        v
Search knowledge source
        |
        v
Retrieve relevant passages
        |
        v
Put passages into the prompt
        |
        v
LLM writes final answer

In one sentence:

RAG = search first, answer second.

Two Main Pipelines

A RAG app usually has two pipelines:

indexing pipeline,
question-answering pipeline.

1. Indexing Pipeline

This happens before the user asks a question.

Documents
  -> clean text
  -> split into chunks
  -> create embeddings
  -> store vectors + text + metadata
  -> make the knowledge base searchable

Example:

Company handbook
  -> chunk 1: Password reset policy
  -> chunk 2: Vacation approval process
  -> chunk 3: Expense rules
  -> chunk 4: Laptop return process

Each chunk is embedded and stored so the system can find it later.

2. Question-Answering Pipeline

This happens when the user asks a question.

Question
  -> turn question into search query
  -> retrieve top matching chunks
  -> add chunks to the prompt
  -> ask the LLM to answer using those chunks
  -> return answer with sources

Example:

User:
  "How do I reset my password?"

Retriever finds:
  "To reset your password, open the identity portal..."

LLM answers:
  "Open the identity portal and choose Reset Password. Source: IT Handbook."

Component Breakdown

Component	What It Does
User query	The question or task from the user
Retriever	Searches for relevant information
Knowledge source	Documents, database rows, tickets, web pages, notes, or files
Chunks	Small pieces of documents
Embeddings	Number vectors that represent meaning
Vector store	Stores embeddings and searches by similarity
Metadata	Source, URL, date, owner, permission, category
Prompt builder	Combines user question and retrieved context
LLM	Generates the final answer
Citations	Show where the answer came from

RAG With Embeddings

Most modern RAG systems use embeddings and vector search.

Document chunk:
  "Employees can reset passwords in the identity portal."

Embedding:
  [0.12, -0.44, 0.87, ...]

User question:
  "How can I recover my account password?"

Query embedding:
  [0.11, -0.40, 0.82, ...]

Vector search:
  These meanings are close, so return the chunk.

This is useful because the user and the document do not need to use the same words.

Simple RAG Example

Imagine this small knowledge base:

Document A:
  Passwords can be reset from the identity portal.

Document B:
  Expense reports must be submitted within 30 days.

Document C:
  Laptops must be returned to IT before the last work day.

User asks:

Where do I reset my password?

RAG flow:

1. Search the knowledge base.
2. Retrieve Document A.
3. Add Document A to the prompt.
4. Ask the LLM to answer only from the retrieved context.
5. Return the answer.

Prompt sent to the LLM:

Use the context below to answer the user.
If the context does not contain the answer, say you do not know.

Context:
Passwords can be reset from the identity portal.

Question:
Where do I reset my password?

Good answer:

You can reset your password from the identity portal.

Bad answer:

You can reset it by calling your manager.

The bad answer is not grounded in the retrieved source.

RAG As A Developer Workflow

For developers, a simple RAG system can be built in this order:

Choose a small knowledge source.
Split the documents into chunks.
Store each chunk with metadata.
Create embeddings for each chunk.
Save embeddings in a vector database or local vector index.
Embed the user's question.
Retrieve the most similar chunks.
Put the chunks into the prompt.
Tell the LLM to answer using only the provided context.
Return the answer and source.

Prompt Template

A basic RAG prompt should be strict.

You are a helpful assistant.
Answer the question using only the provided context.
If the context does not contain the answer, say:
"I do not know from the provided context."
Include the source name when possible.

Context:
{retrieved_chunks}

Question:
{user_question}

This reduces unsupported answers. It does not fully remove hallucinations, so you still need testing.

When To Use RAG

Use RAG when:

the answer depends on private data,
the answer depends on recent data,
the knowledge changes often,
users need citations,
the model should answer from a known source,
fine-tuning would be too slow or expensive,
the model needs access to many documents.

Do not use RAG for everything.

RAG may be unnecessary when:

the task is simple writing or rewriting,
the answer is already in the prompt,
the model's general knowledge is enough,
there is no external knowledge source,
incorrect retrieval would add more risk than value.