Token Limits
Understanding token limits in AI coding tools — what tokens are, how limits affect your workflow, and practical strategies to work effectively within them.
What Are Tokens?
Tokens are the fundamental units that large language models use to process text. A token is roughly equivalent to 0.75 words in English, meaning a 1,000-word document contains approximately 1,333 tokens. However, tokenization varies significantly:
- Common English words are typically one token: “the,” “code,” “function”
- Longer or less common words may be split: “refactoring” → “re” + “factor” + “ing” (3 tokens)
- Code tokenizes differently than prose: variable names, operators, and whitespace all consume tokens
- Non-English text is often less efficient: the same content in Korean or Arabic may use 2-4x more tokens than English
Understanding tokenization is practically important because every LLM has a finite context window measured in tokens, and understanding where those tokens go determines how effectively you can use the model.
Context Window Limits by Model
| Model | Context Window | Approximate Words |
|---|---|---|
| GPT-4o | 128,000 tokens | ~96,000 words |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 words |
| Gemini 1.5 Pro | 1,000,000 tokens | ~750,000 words |
| Llama 3.1 (local) | 128,000 tokens | ~96,000 words |
| Mistral 7B (local) | 32,768 tokens | ~24,500 words |
| Smaller local models | 4,096-8,192 tokens | ~3,000-6,000 words |
The “output limit” — how many tokens the model can generate in a single response — is typically much lower than the context window, usually 4,096 to 16,384 tokens depending on the model.
How Token Limits Affect Code Generation
Token limits create practical constraints at every stage of AI-assisted development:
Single file generation: Most files of moderate complexity fit comfortably within output limits. But generating a large service class (500+ lines), a complete test suite, or a complex multi-part module may hit the output limit mid-generation, truncating the output.
Multi-file codebase context: Providing full file content as context to a model is token-expensive. A project with 20 files averaging 200 lines each might consume 40,000+ tokens just in context, leaving little room for the actual prompt and response.
Conversation history: In multi-turn sessions, every previous message is included in the context. Long conversations eventually push older context out of the window entirely, causing the model to “forget” earlier decisions and constraints.
Practical Strategies for Token Budget Management
Strategy 1: Selective Context Inclusion
Instead of pasting entire files, include only the relevant sections — the specific functions, interfaces, and types that relate to the current task. Use comments to indicate what was omitted: // ... (AuthService class, 80 lines) ...
Strategy 2: Reference Over Inclusion
For files in your codebase that the AI tool can access directly (like GitHub Copilot or Cursor with codebase indexing enabled), reference files by path rather than pasting them inline. The tool retrieves relevant sections automatically, using RAG (retrieval-augmented generation) rather than raw context inclusion.
Strategy 3: Summarize vs. Paste
For long documents, APIs, or libraries, paste a summary of the relevant parts rather than the full content. “This uses the Prisma ORM. The User model has fields: id (cuid), email (unique String), role (enum: USER|ADMIN), createdAt (DateTime)” is far more token-efficient than the full Prisma schema.
Strategy 4: Chunk Large Tasks
When generating large amounts of code, chunk the work into pieces that fit within the output limit: generate the data model first, then the service layer, then the controller, then the tests. Each chunk is within limits; the full output is generated iteratively.
Strategy 5: Session Management
Start fresh sessions when switching between unrelated tasks. The token cost of carrying irrelevant conversation history is compounded by the attention degradation that occurs with very long contexts — the model becomes less reliable as context grows.
Token Costs and Optimization
For API-based usage (OpenAI, Anthropic, Google), tokens directly translate to cost:
- GPT-4o: ~$2.50/1M input tokens, ~$10/1M output tokens
- Claude 3.5 Sonnet: ~$3/1M input tokens, ~$15/1M output tokens
For high-volume usage, token optimization becomes an economic priority. Structured, efficient prompts that provide precisely the context needed — no more — reduce costs while often improving output quality simultaneously. The discipline of token efficiency and the discipline of prompt quality are largely the same discipline.
Local Models and Token Limits
Local models (Ollama, LM Studio, Jan) typically have much smaller context windows — 4k to 32k tokens — which severely constrains what can be included in a single prompt. This makes context management even more critical for local deployments:
- Use system prompts efficiently (under 200 tokens)
- Never paste large files; always excerpt relevant sections
- Use shorter, more targeted sessions rather than extended multi-turn conversations
- Consider models specifically fine-tuned for code, which use their token budget more efficiently on technical content
Token Counting Before Prompting
For API usage where token costs matter, count your tokens before sending long prompts. Most providers offer token counting endpoints, and the tiktoken library (Python) counts OpenAI tokens locally. This lets you verify you’re within budget and optimize before incurring cost.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode(your_prompt))
print(f"Prompt uses {token_count} tokens (${token_count/1000 * 0.0025:.4f})")