Running LLMs locally used to require a machine learning PhD and a rack of GPUs. Ollama changed that. It is a single binary that manages model downloads, quantization, and a local OpenAI-compatible API endpoint. I started using Ollama for internal tooling at Commsult Indonesia where data privacy requirements make cloud LLM APIs complicated — sending employee or client data to OpenAI or Anthropic raises compliance questions our legal team needed time to evaluate. Local inference solved that immediately. This guide covers my production setup: model selection, hardware sizing, API integration, and the honest trade-offs versus cloud APIs.
The quantization revolution is why local LLMs are viable in 2025. A 70B parameter model like Llama 3.1 70B, quantized to 4-bit (Q4_K_M), requires about 40GB of RAM and can run on a workstation with 64GB unified memory or two consumer GPUs. The 8B models like Llama 3.1 8B run comfortably on 8GB VRAM — a single RTX 3080 or Apple M2 Pro. Quality for most business tasks (summarization, classification, structured extraction, Q&A over documents) is now within 85-90% of GPT-4-class models. For tasks like code completion, formatting ERP data, and generating reports, the gap is even smaller.
For production local inference, I use: Llama 3.1 8B (Q4_K_M) for fast, latency-sensitive tasks like chat and quick lookups — 2-4 tokens/second on CPU, 20-40 tokens/second on GPU. Llama 3.1 70B (Q4_K_M) for complex reasoning, code generation, and document analysis — runs on our dev server with 128GB RAM. Mistral NeMo 12B for structured extraction tasks — excellent at following JSON schemas. Qwen2.5-Coder 7B for code-specific tasks — beats larger general models on coding benchmarks at a fraction of the compute cost. Always use the -Instruct or -Chat variants, not the base models.
┌─────────────────────────────────────────────────────────────┐
│ Ollama Production Architecture │
│ │
│ Application Layer │
│ ┌──────────────────────────────────────┐ │
│ │ NestJS / Node.js App │ │
│ │ (uses OpenAI SDK, base_url=local) │ │
│ └─────────────────┬────────────────────┘ │
│ │ HTTP │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Nginx (TLS + API Key Auth) │ │
│ └─────────────────┬────────────────────┘ │
│ │ http://localhost:11434 │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Ollama Service │ │
│ │ - llama3.1:8b (chat) │ │
│ │ - qwen2.5-coder:7b (code) │ │
│ │ - mistral-nemo:12b (extraction) │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ GPU: NVIDIA RTX 4090 (24GB VRAM) │
└─────────────────────────────────────────────────────────────┘From my experience running Ollama in production on a DigitalOcean GPU Droplet: preload your most frequently used models at startup to avoid cold-start latency. Ollama keeps models in memory if you set OLLAMA_KEEP_ALIVE=24h in the environment. The first request after loading a model takes 3-8 seconds for the initial load; subsequent requests are nearly instant. For latency-sensitive applications, implement a warm-up call on server startup.
Running Ollama as a production service rather than a developer tool requires several additional steps: systemd service configuration, resource limits, authentication proxy, and monitoring. The default Ollama setup has no authentication — anyone on your network can call the API. For production, always put an authenticated nginx reverse proxy in front.
The configuration I use puts Ollama behind nginx with HTTP basic auth for simple internal deployments, or an API key validation proxy for team deployments. For enterprise use, a more robust solution is an OpenAI-compatible proxy like LiteLLM that handles authentication, rate limiting, model routing, and cost tracking across both local and cloud models from a single endpoint.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Configure as systemd service with resource limits
sudo tee /etc/systemd/system/ollama.service > /dev/null <<'EOF'
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=4"
[Install]
WantedBy=default.target
EOF
sudo systemctl enable --now ollama
# Pull models
ollama pull llama3.1:8b
ollama pull qwen2.5-coder:7b
ollama pull mistral-nemo:12b
# Nginx API key auth proxy
sudo tee /etc/nginx/sites-available/ollama > /dev/null <<'EOF'
server {
listen 443 ssl;
server_name llm.yourcompany.com;
ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;
location /v1/ {
# Validate API key
if ($http_authorization != "Bearer YOUR_SECRET_API_KEY") {
return 401 '{"error": "Unauthorized"}';
}
proxy_pass http://127.0.0.1:11434;
proxy_read_timeout 300s;
}
}
EOF
# Use with OpenAI SDK (drop-in replacement)
# import OpenAI from 'openai'
# const client = new OpenAI({
# baseURL: 'https://llm.yourcompany.com/v1',
# apiKey: 'YOUR_SECRET_API_KEY',
# })
# const response = await client.chat.completions.create({
# model: 'llama3.1:8b',
# messages: [{ role: 'user', content: 'Hello!' }],
# })Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1, which means any OpenAI SDK works without modification. Switch base_url and model name, and your existing code runs against local models. This is the fastest path to local inference for existing applications. I use this to A/B test local vs cloud responses for the same prompts, measure quality differences, and gradually migrate cost-sensitive workloads to local models.
Running out of VRAM during inference causes the model to fall back to CPU computation, which is 10-20x slower. This is not obvious from logs — the API still responds, just extremely slowly. If you are seeing unexpectedly slow inference (under 2 tokens/second for an 8B model), check GPU memory usage with nvidia-smi. The rule of thumb: Q4 quantized model size in GB plus 2GB overhead must fit in VRAM. For a 13B Q4 model (~8GB), you need at least 10GB VRAM. Size your hardware with headroom — running at 95% VRAM utilization causes cache thrashing.
Local LLMs win when: data privacy requirements prevent sending data to third parties, you have high volume repetitive tasks (classification, extraction) where API costs add up, latency requirements need sub-100ms responses (possible with 7B models on GPU), or you need guaranteed availability without API rate limits. Cloud LLMs win when: you need the best possible quality (GPT-4o, Claude Opus), tasks require the largest context windows, you need built-in multimodal capabilities, or your team lacks the infrastructure expertise to maintain local deployments.
Current production setup at Commsult Indonesia: Ollama running on a dedicated server with 64GB RAM and NVIDIA RTX 4090 (24GB VRAM). Models loaded: Llama 3.1 8B Instruct (primary chat), Qwen2.5-Coder 7B (code tasks), Mistral NeMo 12B (document extraction). Authentication via nginx with API key validation. Monitoring via Prometheus with a custom Ollama exporter tracking request latency, token throughput, and model load times. Monthly compute cost: fixed server depreciation — zero per-token costs. At our usage volume (~50K tokens/day), this is 80% cheaper than GPT-4o for equivalent quality tasks.