Why use Ollama for local LLM inference instead of cloud APIs like OpenAI or Anthropic?

At Commsult Indonesia, the primary driver was data privacy — sending employee or client data to cloud providers raised compliance questions that local inference solved immediately. Beyond privacy, local inference eliminates per-token costs, and at ~50K tokens/day the production setup is 80% cheaper than GPT-4o for equivalent quality tasks.

How do you handle authentication since Ollama has no built-in auth?

The default Ollama setup has no authentication, so the recommendation is to always put an authenticated nginx reverse proxy in front for production deployments. For team use, an OpenAI-compatible proxy like LiteLLM adds authentication, rate limiting, model routing, and cost tracking across both local and cloud models from a single endpoint.

How do you avoid cold-start latency when Ollama loads a model for the first time?

Set OLLAMA_KEEP_ALIVE=24h in the environment so Ollama keeps models in memory between requests. Additionally, preload the most frequently used models at server startup and implement a warm-up call on startup. The first request after loading a model takes 3–8 seconds; subsequent requests are nearly instant.

When do local LLMs lose to cloud models, and when do they win?

Local LLMs win for data-privacy-sensitive workloads, high-volume repetitive tasks like classification and extraction where API costs accumulate, sub-100ms latency requirements on GPU, and when guaranteed availability without rate limits matters. Cloud models win when you need the absolute highest quality (GPT-4o, Claude Opus), the largest context windows, built-in multimodal capabilities, or when your team lacks the infrastructure expertise to maintain local deployments.

Running Local LLMs with Ollama in Production: What I Learned

Q: What hardware do you need to run Llama 3.1 8B or 70B locally in production?

The 8B model (Q4_K_M quantization) runs on 8GB VRAM — a single RTX 3080 or Apple M2 Pro is sufficient. The 70B model requires about 40GB of RAM and fits on a workstation with 64GB unified memory or two consumer GPUs. The key rule: Q4 model size in GB plus 2GB overhead must fit in VRAM, or inference silently falls back to CPU at 10–20x slower speeds.

Running LLMs locally used to require a machine learning PhD and a rack of GPUs. Ollama changed that. It is a single binary that manages model downloads, quantization, and a local OpenAI-compatible API endpoint. I started using Ollama for internal tooling at Commsult Indonesia where data privacy requirements make cloud LLM APIs complicated — sending employee or client data to OpenAI or Anthropic raises compliance questions our legal team needed time to evaluate. Local inference solved that immediately. This guide covers my production setup: model selection, hardware sizing, API integration, and the honest trade-offs versus cloud APIs.

Why Local LLMs Are Now Practical

The quantization revolution is why local LLMs are viable in 2025. A 70B parameter model like Llama 3.1 70B, quantized to 4-bit (Q4_K_M), requires about 40GB of RAM and can run on a workstation with 64GB unified memory or two consumer GPUs. The 8B models like Llama 3.1 8B run comfortably on 8GB VRAM — a single RTX 3080 or Apple M2 Pro. Quality for most business tasks (summarization, classification, structured extraction, Q&A over documents) is now within 85-90% of GPT-4-class models. For tasks like code completion, formatting ERP data, and generating reports, the gap is even smaller.

Model Selection Guide

For production local inference, I use: Llama 3.1 8B (Q4_K_M) for fast, latency-sensitive tasks like chat and quick lookups — 2-4 tokens/second on CPU, 20-40 tokens/second on GPU. Llama 3.1 70B (Q4_K_M) for complex reasoning, code generation, and document analysis — runs on our dev server with 128GB RAM. Mistral NeMo 12B for structured extraction tasks — excellent at following JSON schemas. Qwen2.5-Coder 7B for code-specific tasks — beats larger general models on coding benchmarks at a fraction of the compute cost. Always use the -Instruct or -Chat variants, not the base models.

┌─────────────────────────────────────────────────────────────┐
│              Ollama Production Architecture                  │
│                                                             │
│  Application Layer                                          │
│  ┌──────────────────────────────────────┐                  │
│  │  NestJS / Node.js App               │                  │
│  │  (uses OpenAI SDK, base_url=local)  │                  │
│  └─────────────────┬────────────────────┘                  │
│                    │ HTTP                                   │
│                    ▼                                        │
│  ┌──────────────────────────────────────┐                  │
│  │  Nginx (TLS + API Key Auth)          │                  │
│  └─────────────────┬────────────────────┘                  │
│                    │ http://localhost:11434                  │
│                    ▼                                        │
│  ┌──────────────────────────────────────┐                  │
│  │  Ollama Service                      │                  │
│  │  - llama3.1:8b (chat)               │                  │
│  │  - qwen2.5-coder:7b (code)          │                  │
│  │  - mistral-nemo:12b (extraction)    │                  │
│  └──────────────────────────────────────┘                  │
│                    │                                        │
│                    ▼                                        │
│  GPU: NVIDIA RTX 4090 (24GB VRAM)                          │
└─────────────────────────────────────────────────────────────┘

From my experience running Ollama in production on a DigitalOcean GPU Droplet: preload your most frequently used models at startup to avoid cold-start latency. Ollama keeps models in memory if you set OLLAMA_KEEP_ALIVE=24h in the environment. The first request after loading a model takes 3-8 seconds for the initial load; subsequent requests are nearly instant. For latency-sensitive applications, implement a warm-up call on server startup.

Production Setup and Configuration

Running Ollama as a production service rather than a developer tool requires several additional steps: systemd service configuration, resource limits, authentication proxy, and monitoring. The default Ollama setup has no authentication — anyone on your network can call the API. For production, always put an authenticated nginx reverse proxy in front.

Nginx Auth Proxy for Ollama

The configuration I use puts Ollama behind nginx with HTTP basic auth for simple internal deployments, or an API key validation proxy for team deployments. For enterprise use, a more robust solution is an OpenAI-compatible proxy like LiteLLM that handles authentication, rate limiting, model routing, and cost tracking across both local and cloud models from a single endpoint.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Configure as systemd service with resource limits
sudo tee /etc/systemd/system/ollama.service > /dev/null <<'EOF'
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=4"

[Install]
WantedBy=default.target
EOF

sudo systemctl enable --now ollama

# Pull models
ollama pull llama3.1:8b
ollama pull qwen2.5-coder:7b
ollama pull mistral-nemo:12b

# Nginx API key auth proxy
sudo tee /etc/nginx/sites-available/ollama > /dev/null <<'EOF'
server {
    listen 443 ssl;
    server_name llm.yourcompany.com;

    ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;

    location /v1/ {
        # Validate API key
        if ($http_authorization != "Bearer YOUR_SECRET_API_KEY") {
            return 401 '{"error": "Unauthorized"}';
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_read_timeout 300s;
    }
}
EOF

# Use with OpenAI SDK (drop-in replacement)
# import OpenAI from 'openai'
# const client = new OpenAI({
#   baseURL: 'https://llm.yourcompany.com/v1',
#   apiKey: 'YOUR_SECRET_API_KEY',
# })
# const response = await client.chat.completions.create({
#   model: 'llama3.1:8b',
#   messages: [{ role: 'user', content: 'Hello!' }],
# })

API Integration: Drop-in OpenAI Replacement

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1, which means any OpenAI SDK works without modification. Switch base_url and model name, and your existing code runs against local models. This is the fastest path to local inference for existing applications. I use this to A/B test local vs cloud responses for the same prompts, measure quality differences, and gradually migrate cost-sensitive workloads to local models.

Running out of VRAM during inference causes the model to fall back to CPU computation, which is 10-20x slower. This is not obvious from logs — the API still responds, just extremely slowly. If you are seeing unexpectedly slow inference (under 2 tokens/second for an 8B model), check GPU memory usage with nvidia-smi. The rule of thumb: Q4 quantized model size in GB plus 2GB overhead must fit in VRAM. For a 13B Q4 model (~8GB), you need at least 10GB VRAM. Size your hardware with headroom — running at 95% VRAM utilization causes cache thrashing.

When Local Beats Cloud

Local LLMs win when: data privacy requirements prevent sending data to third parties, you have high volume repetitive tasks (classification, extraction) where API costs add up, latency requirements need sub-100ms responses (possible with 7B models on GPU), or you need guaranteed availability without API rate limits. Cloud LLMs win when: you need the best possible quality (GPT-4o, Claude Opus), tasks require the largest context windows, you need built-in multimodal capabilities, or your team lacks the infrastructure expertise to maintain local deployments.

My Production Stack

Current production setup at Commsult Indonesia: Ollama running on a dedicated server with 64GB RAM and NVIDIA RTX 4090 (24GB VRAM). Models loaded: Llama 3.1 8B Instruct (primary chat), Qwen2.5-Coder 7B (code tasks), Mistral NeMo 12B (document extraction). Authentication via nginx with API key validation. Monitoring via Prometheus with a custom Ollama exporter tracking request latency, token throughput, and model load times. Monthly compute cost: fixed server depreciation — zero per-token costs. At our usage volume (~50K tokens/day), this is 80% cheaper than GPT-4o for equivalent quality tasks.

Sources & Further Reading

Frequently Asked Questions

Running Local LLMs with Ollama in Production: What I Learned

Frequently Asked Questions

Running Local LLMs with Ollama in Production: What I Learned

Why Local LLMs Are Now Practical

Model Selection Guide

Production Setup and Configuration

Nginx Auth Proxy for Ollama

API Integration: Drop-in OpenAI Replacement

When Local Beats Cloud

My Production Stack

Related Articles

Why Local LLMs Are Now Practical

Model Selection Guide

Production Setup and Configuration

Nginx Auth Proxy for Ollama

API Integration: Drop-in OpenAI Replacement

When Local Beats Cloud

My Production Stack

Related Articles