Setting Up Local LLMs for Coding

Overview

The concept of Setting Up Local LLMs for Coding is fundamental to modern AI-assisted software development. Running LLaMA or DeepSeek locally for absolute privacy.

As the landscape of vibe coding continues to evolve, developers are finding that traditional approaches to problem-solving are being replaced by high-level natural language instruction.

Why It Matters

By leveraging this approach, developers can significantly reduce boilerplate, focus on architectural considerations, and accelerate the feedback loop from idea to implementation.

Increases velocity by 2-5x depending on the task complexity.
Shifts the developer’s role from writing syntax to designing systems and reviewing outputs.
Reduces cognitive load when dealing with unfamiliar APIs or languages.

Best Practices

To get the most out of Setting Up Local LLMs for Coding, remember to provide clear constraints and rich context. Large language models operate probabilistically, meaning the quality of the output correlates directly with the specificity of the input.

💡 Pro Tip: Always iterate. Treat the first AI-generated output as a draft, just as you would treat your own first pass at a complex algorithm.

Setting Up a Local LLM for Development

Step 1 — Install Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: download installer from ollama.ai

Step 2 — Pull a Code Model

ollama pull deepseek-coder-v2      # Strong general code model
ollama pull qwen2.5-coder:7b       # Efficient for most machines
ollama pull codellama:13b          # Meta's code-specialized model

Step 3 — Connect to Your IDE

For VS Code: install the Continue extension, configure it with provider ollama and your model name. For Cursor: use the local model option in Settings. For Neovim: use avante.nvim or codecompanion.nvim with Ollama backend.

Step 4 — Configure System Prompt For code-focused use, set a system prompt that specifies your stack: “You are a coding assistant specializing in TypeScript, Next.js, and PostgreSQL. Always use TypeScript strict mode. Prefer functional patterns.”

Hardware Requirements

7B parameter models: 8GB RAM minimum, 16GB recommended
13B parameter models: 16GB RAM minimum, 32GB recommended
34B+ models: 32GB+ RAM or GPU with 20GB+ VRAM

Apple Silicon Macs run Ollama models efficiently via Metal GPU acceleration — a MacBook Pro M3 with 16GB unified memory runs 7B models at practical speeds.

Choosing the Right Model for Your Hardware

Hardware	Recommended Model	Rationale
8GB RAM	Qwen2.5-Coder:3B	Fits comfortably, good code quality
16GB RAM	DeepSeek-Coder-V2:16B or Qwen2.5-Coder:7B	Best quality for available RAM
32GB RAM	DeepSeek-Coder-V2:33B	Near-frontier code quality locally
M-series Mac (16GB)	Any 7B-13B	Apple Metal acceleration makes these fast
GPU 12GB VRAM	Qwen2.5-Coder:14B via llama.cpp	GPU acceleration significantly faster

Configuring Continue.dev for Ollama

Continue.dev is the most capable open-source AI coding IDE extension for VS Code and JetBrains. Configuration for local Ollama:

// ~/.continue/config.json
{
  "models": [{
    "title": "Local DeepSeek Coder",
    "provider": "ollama",
    "model": "deepseek-coder-v2",
    "contextLength": 32768
  }],
  "tabAutocompleteModel": {
    "title": "Local Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:3b"
  }
}

Use a smaller model for autocomplete (low latency required) and a larger model for chat (quality more important).

Privacy Benefits

The key advantage of local LLMs: zero data exposure. Code, proprietary logic, and sensitive business information never leave your machine. For teams handling regulated data (healthcare, finance, legal) or working on unreleased products, local LLMs enable AI coding assistance without compliance risk.

Benchmarking Your Local Setup

Once your local model is running, benchmark it for your actual use case — not generic benchmarks. Run 10 representative prompts from your typical workflow and measure: response latency, code correctness rate, and context retention across a multi-turn session.

This calibration tells you which models are actually useful for your work (some models excel at one language or framework) and sets realistic expectations for when to use local vs. cloud models.

Combining Local and Cloud Models

A practical hybrid strategy: use a fast local model for autocomplete and quick questions (zero latency, zero cost), and a frontier cloud model for complex tasks that require the highest accuracy. Most AI IDE tools (Continue.dev, Cursor) support configuring different models for different functions — autocomplete model vs. chat model vs. edit model.

Keeping Models Updated

Models improve rapidly. Check for updates every 1-2 months: ollama pull deepseek-coder-v2 fetches the latest version. Newer versions of the same model family are almost always improvements, and the download is incremental when possible.

Privacy Configuration Audit

After setup, verify no telemetry is being sent from your IDE extension to external services. Check each tool’s privacy settings explicitly — most AI extensions default to sending usage data. For maximum privacy, disable all telemetry and verify the extension’s network connections in your firewall or proxy logs.

Troubleshooting Common Issues

Model loads but generates slowly: Check if the model is running on CPU instead of GPU. Run NAME ID SIZE PROCESSOR CONTEXT UNTIL to see memory usage. On Apple Silicon, ensure Metal acceleration is enabled (it is by default). On NVIDIA GPUs, verify CUDA is being used.

Out of memory errors: The model is too large for your RAM. Try a smaller quantization (Q4 instead of Q8) or a smaller parameter count. Run NAME ID SIZE MODIFIED
garnet:latest 07f3c25b209d 34 GB 15 hours ago
garnet:20260312-025510 2068c93a1fd1 16 GB 7 days ago
garnet:20260312-023024 e630b8348c55 16 GB 7 days ago
qwen2.5:32b-instruct-q8_0 378290543760 34 GB 3 weeks ago
qwen2.5-coder:32b b92d6a0bd47e 19 GB 3 weeks ago
mxbai-embed-large:latest 468836162de7 669 MB 3 weeks ago to see available models, then to remove large unused models.

Context window errors: You have exceeded the model’s context limit. Reduce the size of your prompt or conversation history. Start a fresh session with only the relevant context.

IDE extension not connecting: Verify Ollama is running with {“models”:[{“name”:“garnet:latest”,“model”:“garnet:latest”,“modified_at”:“2026-03-19T03:05:48.886949119-04:00”,“size”:34820900479,“digest”:“07f3c25b209d30fe4ac7899d43367cb183464c7cb6f4c53a67cd404513ea762a”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“qwen2”,“families”:[“qwen2”],“parameter_size”:“32.8B”,“quantization_level”:“Q8_0”}},{“name”:“garnet:20260312-025510”,“model”:“garnet:20260312-025510”,“modified_at”:“2026-03-12T02:55:38.292032002-04:00”,“size”:16069406315,“digest”:“2068c93a1fd19b9f2a5a877fda735c28ddcaeeb679ded4069f8d0778108860f0”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“llama”,“families”:[“llama”],“parameter_size”:“8.0B”,“quantization_level”:“F16”}},{“name”:“garnet:20260312-023024”,“model”:“garnet:20260312-023024”,“modified_at”:“2026-03-12T02:31:02.325566745-04:00”,“size”:16069405359,“digest”:“e630b8348c55f5737dcd4eb97ec5971301507ff16e41443ea73924ce6fabd7b4”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“llama”,“families”:[“llama”],“parameter_size”:“8.0B”,“quantization_level”:“F16”}},{“name”:“qwen2.5:32b-instruct-q8_0”,“model”:“qwen2.5:32b-instruct-q8_0”,“modified_at”:“2026-02-26T01:41:01.812145889-05:00”,“size”:34820898467,“digest”:“3782905437606471a76043fbd222a166f6a5c67e5d330b87ed2f692d33985dd3”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“qwen2”,“families”:[“qwen2”],“parameter_size”:“32.8B”,“quantization_level”:“Q8_0”}},{“name”:“qwen2.5-coder:32b”,“model”:“qwen2.5-coder:32b”,“modified_at”:“2026-02-22T20:38:01.883616554-05:00”,“size”:19851349898,“digest”:“b92d6a0bd47ee79114298de0177bf920c05a706d12633950b3936778492bef41”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“qwen2”,“families”:[“qwen2”],“parameter_size”:“32.8B”,“quantization_level”:“Q4_K_M”}},{“name”:“mxbai-embed-large:latest”,“model”:“mxbai-embed-large:latest”,“modified_at”:“2026-02-19T20:36:32.205497059-05:00”,“size”:669615493,“digest”:“468836162de7f81e041c43663fedbbba921dcea9b9fefea135685a39b2d83dd8”,“details”:{“parent_model”:"",“format”:“gguf”,“family”:“bert”,“families”:[“bert”],“parameter_size”:“334M”,“quantization_level”:“F16”}}]}. If Ollama is running but the extension cannot connect, check for firewall rules blocking localhost connections.

Performance Optimization

For best local performance: close memory-intensive applications while running large models, let the model fully load before beginning (the first response is slowest), and use the API directly (via Continue.dev or similar) rather than the Ollama web UI for lower latency. Quantized models (Q4, Q5) are significantly faster than full precision with minimal quality loss for most coding tasks.