Tokenization

Stage 02 - LLM Fundamentals

Tokenization is the step that turns text into the small units an LLM can process. If you understand tokens, you can estimate cost, manage context, debug strange model behavior, and design cleaner prompts for agents.

Text to tokens Tokens to IDs Cost and latency Agent context design

Goal

Understand how LLMs split text into tokens, why token counts are not the same as word counts, and how tokenization affects prompts, context, cost, latency, chunking, and agent reliability.

Learning Path

This topic is designed in four parts. Read them in order.

Part 1 - Understand What Tokenization Does Learn how raw text becomes token pieces, token IDs, and model input. Part 2 - See How Modern Tokenizers Split Text Compare word, character, subword, BPE, WordPiece, Unigram, and SentencePiece. Part 3 - Connect Tokens to Cost, Context, and Latency Use token counts to plan prompts, outputs, tool results, and agent loops. Part 4 - Debug Tokenization in Real Projects Use the target model tokenizer, inspect surprising counts, and prevent common mistakes.

Part 1: Understand What Tokenization Does

An LLM does not directly read text the way humans do. Before text reaches the model, a tokenizer splits it into tokens and maps each token to a number called a token ID.

The model then works with numbers, not raw letters. After generation, the output token IDs are decoded back into readable text.

The Basic Pipeline

flowchart LR
    A[Raw text] --> B[Tokenizer]
    B --> C[Token pieces]
    C --> D[Token IDs]
    D --> E[Embeddings]
    E --> F[LLM]
    F --> G[Output token IDs]
    G --> H[Decoded text]

How to read this diagram: tokenization is only the early conversion step. It creates token pieces and token IDs. The model later turns those IDs into embeddings and uses them to predict the next token.

A Token Is Not Always a Word

A token can be:

a full word
part of a word
punctuation
whitespace
a symbol
a byte-level piece
a special control token used by the model or API

Example:

Text:
Tokenization matters.

Possible token pieces:
["Token", "ization", " matters", "."]

This example is illustrative. The exact split depends on the model's tokenizer.

Why This Matters

Tokenization explains why two strings that look similar to a human can cost different amounts to process.

Text Pattern	Why Token Count Can Change
`AI agent` vs `AI-agent`	Punctuation can create different token pieces.
`red`, `Red`, and `Red`	Spaces and capitalization can change token IDs.
English prose vs code	Code uses symbols, indentation, and identifiers.
English vs Japanese or Arabic	Different languages may use more or fewer tokens per word.
Short answer vs detailed answer	Output tokens are billed and counted too.

The practical rule is simple: never assume tokens equal words. Count tokens with the tokenizer for the model you will actually use.

Part 2: See How Modern Tokenizers Split Text

Modern LLMs usually use subword tokenization. Subword tokenization sits between word-level and character-level tokenization.

Three Levels of Tokenization

Method	Example Split	Strength	Weakness
Word-level	`["unbelievable"]`	Easy to understand	Huge vocabulary; struggles with rare words.
Character-level	`["u", "n", "b", ...]`	Can represent any word	Creates long sequences; less meaning per token.
Subword-level	`["un", "believ", "able"]`	Handles common and rare words well	Splits are model-specific and not always intuitive.

Subword tokenization is common because it balances two needs:

Keep frequent words or phrases compact.
Break rare words, names, code identifiers, and new terms into known pieces.

Common Tokenizer Families

You do not need to memorize every algorithm, but you should know what each one is for. The examples below are simplified to show the idea. Real token splits depend on the model's trained vocabulary.

Tokenizer Family	Core Idea	Simple Example Split	Common Use
BPE	Start with small units and repeatedly merge frequent adjacent pairs.	`lowest` -> `["low", "est"]` after learning common pieces like `low` and `est`.	Many GPT-style and modern transformer models.
Byte-level BPE	Use bytes as the base units so almost any text can be represented.	`café` can be represented safely because the tokenizer can fall back to byte pieces if needed.	Useful for broad text coverage and avoiding unknown characters.
WordPiece	Similar to BPE, but chooses merges using a likelihood-based score.	`unwanted` -> `["un", "##want", "##ed"]` in a BERT-style format.	BERT-family models.
Unigram	Start with many candidate pieces and remove the least useful ones.	`unbelievable` may become `["un", "believable"]` or `["un", "believ", "able"]`, then the highest-probability split is selected.	Some encoder-decoder and multilingual models.
SentencePiece	A tokenizer library that can apply BPE or Unigram directly to raw text, including spaces.	`Hello world` -> `["▁Hello", "▁world"]`, where `▁` represents a space.	Multilingual models and languages without clear word spaces.

Simple BPE Example

BPE learns useful pieces from repeated patterns in training text.

Imagine a tiny training set:

hug, hugs, pug, pugs

A simple BPE-style process might learn:

Start:
h u g
h u g s
p u g
p u g s

Frequent pair:
u + g -> ug

After merge:
h ug
h ug s
p ug
p ug s

The tokenizer has learned that ug is a useful piece. Larger real tokenizers do this over huge text corpora and learn thousands of pieces.

Special Tokens

LLM systems may also use special tokens. These are not normal words. They help structure the model input.

Examples include:

beginning-of-sequence markers
end-of-sequence markers
message role separators
tool-call separators
padding tokens for batches
unknown-token markers in some tokenizers

For agent builders, this matters because the final prompt is not only the visible user message. The model may also receive system instructions, developer instructions, conversation history, tool schemas, tool outputs, and hidden formatting tokens added by the API or framework.

Part 3: Connect Tokens to Cost, Context, and Latency

Tokenization becomes practical when you connect it to engineering limits.

The Four Token Counts Developers Must Track

Count	Meaning	Why It Matters
Input tokens	Tokens sent to the model.	Affects context usage, cost, and latency.
Output tokens	Tokens generated by the model.	Affects cost, response length, and user wait time.
Cached tokens	Previously processed tokens reused by some APIs.	Can reduce cost or latency depending on provider behavior.
Reasoning tokens	Internal tokens used by some reasoning models.	Can increase total token usage even when the visible answer is short.

Do not only count the user's message. In an agent system, input tokens can include:

system prompt
developer prompt
user request
previous messages
retrieved documents
tool definitions
tool results
examples
output schemas
current plan or scratch state

Token Budget Formula

For one model call, think in this shape:

total_token_budget =
  system_and_developer_tokens
  + conversation_tokens
  + retrieved_context_tokens
  + tool_schema_tokens
  + tool_result_tokens
  + expected_output_tokens

The total must fit inside the model's context limit. The context window topic explains that limit in more detail, but tokenization is the measuring system behind it.

Agent Example

Suppose an agent answers a question using retrieved documents and a calculator tool.

System instructions:     700 tokens
User question:            80 tokens
Conversation history:   900 tokens
Tool definitions:       600 tokens
Retrieved passages:    2200 tokens
Tool result:            300 tokens
Reserved answer space:  700 tokens

Estimated total:

700 + 80 + 900 + 600 + 2200 + 300 + 700 = 5480 tokens

This estimate helps you decide whether to:

retrieve fewer passages
summarize old history
shrink tool outputs
reduce answer length
use a model with a larger context window
split the task into multiple calls

Why Token Counts Affect User Experience

More tokens usually means:

higher cost
more latency before the answer starts
longer generation time
more chance of irrelevant context distracting the model

Fewer tokens can improve speed and cost, but removing the wrong information can reduce answer quality. Good agent design is not "use the fewest tokens possible." It is use the right tokens for the current step.

Part 4: Debug Tokenization in Real Projects

Tokenization problems are common when prompts become long, structured, multilingual, or tool-heavy.

Use the Correct Tokenizer

Different models can tokenize the same text differently. Always count with the tokenizer for the target model.

Optional Python example:

# pip install tiktoken
import tiktoken

model = "YOUR_MODEL_NAME"
text = "Tokenization matters for cost, context, and latency."

encoding = tiktoken.encoding_for_model(model)
tokens = encoding.encode(text)

print("Token count:", len(tokens))
print("Token IDs:", tokens)
print("Decoded pieces:", [encoding.decode([token]) for token in tokens])

If your model does not use tiktoken, use the tokenizer provided by that model vendor or the model's Hugging Face tokenizer.

Common Tokenization Surprises

Problem patterns

Large JSON blobs
Long code files
Base64 strings
HTML with repeated tags
Tables pasted as raw text
Logs with timestamps and IDs
Multilingual content
Emoji and unusual symbols

Better handling

Send only relevant fields
Summarize or chunk long files
Do not paste binary-like data
Strip repeated markup
Use compact tables
Keep key log lines only
Test real languages you support
Inspect exact token counts

Weak vs Strong Token-Aware Prompting

Weak

Here is a full 80-page document and all logs from the last week.
Find the issue.

This wastes context and may bury the useful facts inside irrelevant text.

Strong

Analyze these 25 relevant log lines.
Focus on the payment timeout after deploy 2026-06-01.
Return: likely cause, evidence, next check.

This gives the model fewer but more useful tokens for the current task.

Token-Aware Chunking Rules

When splitting documents for RAG or long-document analysis:

chunk by meaning, not only by character count
keep headings with the paragraphs they describe
avoid cutting code blocks in the middle
reserve room for metadata and citations
leave output space in the context budget
test chunks with the same tokenizer used by the target model

Bad chunking can split an important idea across two chunks. Good chunking keeps each chunk useful by itself.

Checklist for Agent Builders

Before sending a model call, ask:

Did I count tokens with the correct tokenizer?
Did I include only context needed for the next step?
Did I reserve enough output tokens?
Did I compress tool results into useful facts?
Did I avoid pasting raw logs, raw HTML, or huge JSON when only a few fields matter?
Did I test token usage on realistic user inputs, not only small examples?

Practice

Choose one real prompt from this roadmap project or from an AI app you are building.

Count the tokens with the tokenizer for your target model.
Identify which parts consume the most tokens.
Remove or compress low-value context.
Count again.
Compare the model's answer before and after the change.

Record:

original input token count
improved input token count
expected output token budget
answer quality
latency difference, if available
what information you removed
what information you kept

Mini Project

Build a small token budget inspector.

It should accept:

system prompt
user message
retrieved context
tool definitions
tool output
expected output length

It should return:

token count for each section
total estimated tokens
warning if the request is near the model limit
suggestion for what to reduce first

Suggested reduction order:

raw tool output
old conversation history
low-ranked retrieved chunks
repeated examples
overly long formatting instructions

Do not reduce the user's core request, safety rules, or required constraints unless the product has a clear fallback flow.

Exit Criteria

You are ready to move on when you can:

explain tokenization in plain English
explain why tokens are not the same as words
describe how text becomes token IDs
compare word, character, and subword tokenization
explain why BPE-style subword tokenizers are useful
count tokens using the correct tokenizer for a model
estimate how tokens affect context, cost, and latency
debug prompts that are too long or unexpectedly expensive
design cleaner agent prompts by keeping the right tokens