Running Claude Code Locally with Ollama and LiteLLM

The landscape of software engineering is undergoing a seismic shift with the introduction of agentic coding tools. Anthropic recently released Claude Code, a high-performance command-line interface (CLI) tool that allows developers to interact directly with their codebase using the Claude 3.5 Sonnet model. While Claude Code is incredibly powerful, its reliance on a paid API can lead to significant token costs, especially for large-scale refactoring or continuous repository indexing. For developers seeking privacy and cost-efficiency, the logical next step is running these agentic workflows locally. By leveraging Ollama and a proxy layer like LiteLLM, you can redirect Claude Code's capabilities to open-source powerhouses like DeepSeek-V3 or Llama 3.1.

The Challenge of Local Agentic Workflows

Claude Code is designed specifically to work with the Anthropic API. It utilizes advanced features like tool-calling (function calling) and the Model Context Protocol (MCP) to navigate directories, read files, and execute terminal commands. To run this locally, we encounter two main hurdles: the CLI expects an Anthropic-compatible endpoint, and the local model must be sophisticated enough to handle complex agentic reasoning without hallucinating code changes.

While local models are becoming increasingly capable, there are times when you need the absolute precision of a frontier model. In such cases, using a high-speed aggregator like n1n.ai can provide the necessary Claude 3.5 Sonnet or DeepSeek-V3 API access with lower latency and more flexible pricing than direct providers.

Architecture: Bridging the Gap

To make Claude Code work with a local model, we create a stack consisting of:

Claude Code CLI: The frontend agent that manages the coding tasks.
LiteLLM: A proxy server that translates Anthropic API requests into a format that local providers understand.
Ollama: The local inference engine that runs the actual LLM weights (e.g., DeepSeek-V3 or Llama 3.1).

Step 1: Setting Up the Local Inference Engine

First, ensure you have Ollama installed. Ollama simplifies the process of running large language models on macOS, Linux, and Windows. Once installed, pull a model that supports robust tool-calling. DeepSeek-V3 is currently a top choice for coding tasks due to its massive parameter count and specialized training data.

# Pull the DeepSeek-V3 model (ensure you have sufficient VRAM)
ollama pull deepseek-v3

For developers with limited hardware, models like qwen2.5-coder:7b or llama3.1:8b are excellent alternatives that can run on consumer-grade GPUs with 8GB-12GB of VRAM.

Step 2: Configuring LiteLLM as an Anthropic Proxy

Claude Code is hardcoded to talk to Anthropic's servers. We can trick it by setting the ANTHROPIC_BASE_URL environment variable to point to a local LiteLLM instance. LiteLLM acts as a universal translator.

Install LiteLLM via pip:

pip install 'litellm[proxy]'

Create a configuration file named config.yaml to map the Anthropic model name to your local Ollama model:

model_list:
  - model_name: claude-3-5-sonnet-20241022
    litellm_params:
      model: ollama/deepseek-v3
      api_base: 'http://localhost:11434'

Run the proxy:

litellm --config config.yaml --port 4000

Step 3: Launching Claude Code Locally

Now that your local server is mimicking the Anthropic API, you need to configure your environment variables so that the Claude Code CLI sends its requests to LiteLLM instead of the cloud.

# Set the base URL to your LiteLLM proxy
export ANTHROPIC_BASE_URL="http://localhost:4000"

# Provide a dummy API key (LiteLLM requires one to be present)
export ANTHROPIC_API_KEY="sk-local-key"

# Run Claude Code
claude

Once inside the Claude Code interface, you can start issuing commands like "Refactor the authentication logic in auth.ts" or "Write unit tests for the utility functions." The agent will now process these requests through your local DeepSeek model via Ollama.

Technical Deep Dive: Why DeepSeek-V3?

DeepSeek-V3 has emerged as a formidable rival to Claude 3.5 Sonnet. In benchmarks specifically targeting code generation and repository-level understanding, DeepSeek-V3 demonstrates a high success rate in tool-calling. This is critical for Claude Code, as the agent must correctly decide when to use ls, cat, or grep to explore your project. If the model fails to format the tool call correctly, the agentic loop breaks.

However, local execution can be slow if your hardware is underpowered. If you find that local inference latency is < 1 token per second, it may be time to switch back to a cloud-based solution. Using n1n.ai allows you to access these same models (DeepSeek-V3 or Claude 3.5) with enterprise-grade infrastructure, ensuring that your development speed isn't bottlenecked by your local GPU.

Managing Context Windows and Costs

One of the primary benefits of this setup is the elimination of per-token billing. Claude Code frequently sends the entire file content or large chunks of the directory structure to the model. In a cloud environment, this can cost several dollars per hour of active development. Locally, your only cost is electricity.

However, you must be mindful of the context window. Ollama defaults to a 4096-token context for many models. To handle large codebases, you should increase this in your LiteLLM or Ollama configuration:

# In your Ollama Modelfile or LiteLLM config
# Increase context to 32k for better repo understanding
num_ctx: 32768

Pro Tips for Local Agentic Coding

Quantization Matters: If you are running on a 24GB VRAM card (like an RTX 3090/4090), use the Q4_K_M or Q5_K_M quantization of DeepSeek-V3. It offers a perfect balance between intelligence and speed.
MCP Integration: Claude Code supports the Model Context Protocol. You can run local MCP servers (like a Postgres inspector or a Google Search tool) and connect them to your local agent setup for even more power.
Hybrid Approach: Use local models for routine tasks like writing boilerplate or documentation. When you encounter a complex bug that requires deep reasoning, switch your ANTHROPIC_BASE_URL back to a high-performance provider like n1n.ai to get the full power of the original Claude 3.5 Sonnet.

Conclusion

Running Claude Code locally with Ollama and LiteLLM is a game-changer for developers who value privacy and want to experiment with agentic AI without the fear of a massive API bill. While the setup requires some initial configuration, the ability to have an autonomous coding assistant running entirely on your machine is a glimpse into the future of decentralized software development.

For those who require the highest stability and speed without managing their own hardware, n1n.ai provides the most reliable API access to the world's leading models.

Get a free API key at n1n.ai

Source: https://dev.to/proflead/stop-paying-for-tokens-run-claude-code-locally-with-ollama-3o82