Token Limits | VibeCodeWiki

What Are Tokens?

Tokens are the fundamental units that large language models use to process text. A token is roughly equivalent to 0.75 words in English, meaning a 1,000-word document contains approximately 1,333 tokens. However, tokenization varies significantly:

Common English words are typically one token: “the,” “code,” “function”
Longer or less common words may be split: “refactoring” → “re” + “factor” + “ing” (3 tokens)
Code tokenizes differently than prose: variable names, operators, and whitespace all consume tokens
Non-English text is often less efficient: the same content in Korean or Arabic may use 2-4x more tokens than English

Understanding tokenization is practically important because every LLM has a finite context window measured in tokens, and understanding where those tokens go determines how effectively you can use the model.

Context Window Limits by Model

Model	Context Window	Approximate Words
GPT-4o	128,000 tokens	~96,000 words
Claude 3.5 Sonnet	200,000 tokens	~150,000 words
Gemini 1.5 Pro	1,000,000 tokens	~750,000 words
Llama 3.1 (local)	128,000 tokens	~96,000 words
Mistral 7B (local)	32,768 tokens	~24,500 words
Smaller local models	4,096-8,192 tokens	~3,000-6,000 words

The “output limit” — how many tokens the model can generate in a single response — is typically much lower than the context window, usually 4,096 to 16,384 tokens depending on the model.

How Token Limits Affect Code Generation

Token limits create practical constraints at every stage of AI-assisted development:

Single file generation: Most files of moderate complexity fit comfortably within output limits. But generating a large service class (500+ lines), a complete test suite, or a complex multi-part module may hit the output limit mid-generation, truncating the output.

Multi-file codebase context: Providing full file content as context to a model is token-expensive. A project with 20 files averaging 200 lines each might consume 40,000+ tokens just in context, leaving little room for the actual prompt and response.

Conversation history: In multi-turn sessions, every previous message is included in the context. Long conversations eventually push older context out of the window entirely, causing the model to “forget” earlier decisions and constraints.

Practical Strategies for Token Budget Management

Strategy 1: Selective Context Inclusion

Instead of pasting entire files, include only the relevant sections — the specific functions, interfaces, and types that relate to the current task. Use comments to indicate what was omitted: // ... (AuthService class, 80 lines) ...

Strategy 2: Reference Over Inclusion

For files in your codebase that the AI tool can access directly (like GitHub Copilot or Cursor with codebase indexing enabled), reference files by path rather than pasting them inline. The tool retrieves relevant sections automatically, using RAG (retrieval-augmented generation) rather than raw context inclusion.

Strategy 3: Summarize vs. Paste

For long documents, APIs, or libraries, paste a summary of the relevant parts rather than the full content. “This uses the Prisma ORM. The User model has fields: id (cuid), email (unique String), role (enum: USER|ADMIN), createdAt (DateTime)” is far more token-efficient than the full Prisma schema.

Strategy 4: Chunk Large Tasks

When generating large amounts of code, chunk the work into pieces that fit within the output limit: generate the data model first, then the service layer, then the controller, then the tests. Each chunk is within limits; the full output is generated iteratively.

Strategy 5: Session Management

Start fresh sessions when switching between unrelated tasks. The token cost of carrying irrelevant conversation history is compounded by the attention degradation that occurs with very long contexts — the model becomes less reliable as context grows.

Token Costs and Optimization

For API-based usage (OpenAI, Anthropic, Google), tokens directly translate to cost:

GPT-4o: ~$2.50/1M input tokens, ~$10/1M output tokens
Claude 3.5 Sonnet: ~$3/1M input tokens, ~$15/1M output tokens

For high-volume usage, token optimization becomes an economic priority. Structured, efficient prompts that provide precisely the context needed — no more — reduce costs while often improving output quality simultaneously. The discipline of token efficiency and the discipline of prompt quality are largely the same discipline.

Local Models and Token Limits

Local models (Ollama, LM Studio, Jan) typically have much smaller context windows — 4k to 32k tokens — which severely constrains what can be included in a single prompt. This makes context management even more critical for local deployments:

Use system prompts efficiently (under 200 tokens)
Never paste large files; always excerpt relevant sections
Use shorter, more targeted sessions rather than extended multi-turn conversations
Consider models specifically fine-tuned for code, which use their token budget more efficiently on technical content

Token Counting Before Prompting

For API usage where token costs matter, count your tokens before sending long prompts. Most providers offer token counting endpoints, and the tiktoken library (Python) counts OpenAI tokens locally. This lets you verify you’re within budget and optimize before incurring cost.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode(your_prompt))
print(f"Prompt uses {token_count} tokens (${token_count/1000 * 0.0025:.4f})")