Running Open-Source AI Models Locally: A Developer's Guide to Self-Hosted LLMs

Explore the benefits of running open-source AI models locally with this comprehensive guide. Learn how to set up self-hosted LLMs and optimize your costs.
Introduction
I ran the numbers on our OpenAI bill last quarter. A single product feature โ a document summarizer handling about 50,000 requests per month โ cost us $740. Not catastrophic, but it got me thinking: a used RTX 3090 on eBay goes for around $700. One GPU, one-time purchase, unlimited requests forever. What would it actually take to make that switch?
Turns out, quite a lot has changed since I last looked. Qwen 3.5 dropped in March 2026 with multimodal support, mixture-of-experts variants, 256K context windows, and coverage for over 200 languages. All of it downloadable, all of it free. The 9B model fits in 6.6GB of disk space and runs on a MacBook.
None of that means you should rip out your API calls tomorrow. Self-hosting brings operational overhead, and for some workloads the cloud is still the better deal. This guide lays out when local makes sense and when it doesn't, then walks through a concrete setup: Ollama, Docker, Qwen 3.5, wired into a Laravel + Next.js stack with a routing layer that sends cheap tasks to your GPU and expensive ones to the cloud.
API vs Self-Hosted: When Each Makes Sense
Framing this as "local vs cloud" creates a false dichotomy. Most teams will end up using both. The real question is which requests deserve which backend, and that comes down to four things: money, privacy, speed, and quality.
Cost Structure
Cloud APIs charge per token. OpenAI's GPT-4.1 sits at roughly $2 per million input tokens and $8 per million output. Sounds cheap until you multiply it out. That chatbot handling 50,000 conversations a month with an average of 2,000 tokens each? Somewhere between $200 and $800 monthly, depending on how verbose your model gets.
A self-hosted setup looks different. You pay once for hardware โ an RTX 4090 runs about $1,600 โ then your ongoing cost is just electricity. Call it $15-30 per month depending on how hard you push the card. Break-even lands around the 3-6 month mark for teams with moderate traffic. If you're doing fewer than a couple thousand requests per day, APIs win when you account for engineering time. Go above that, and the math flips fast.
Data Governance and Privacy
If you work in healthcare, finance, or legal, you already know this pain. Sending patient records or financial data to a third-party API means data processing agreements, compliance audits, and lawyers who bill by the hour. Running a model on your own iron sidesteps all of that. Every token stays on your network. For GDPR or HIPAA workloads, that alone can justify the setup cost.
Latency
An API call to OpenAI has to survive DNS resolution, a TLS handshake, provider-side queuing, and the return trip. That's 200-500ms before a single token shows up. Local inference on a decent GPU? Under 50ms to first token. For autocomplete, inline code suggestions, or agent tool calls, you can feel that difference in your fingers.
Output Quality
Let's be honest about this one. GPT-4.1, Claude Opus, Gemini 2.5 Pro โ they're still better at complex multi-step reasoning, long-context synthesis, and tricky instruction following. But "better" has a shrinking definition. Qwen 3.5 27B scores within a few percentage points of GPT-4-class models on MMLU-Pro and GPQA. For classification, extraction, summarization, or bread-and-butter code generation, you'd struggle to tell the outputs apart in a blind test.
Factor | Cloud API | Self-Hosted (Local) |
|---|---|---|
Upfront cost | None | $800 - $5,000+ (GPU hardware) |
Marginal cost per request | $0.002 - $0.06 per 1K tokens | Near zero (electricity only) |
Data privacy | Data leaves your network | Fully on-premises |
Latency (time to first token) | 200 - 500ms | 20 - 80ms |
Peak quality (complex reasoning) | Higher (frontier models) | Competitive for most tasks |
Operational overhead | Managed by provider | You own monitoring, scaling, updates |
Scalability | Elastic, pay-as-you-go | Bounded by hardware |
Vendor lock-in | Moderate to high | None |
So what's the play? Handle the 70-80% of requests that don't need frontier brains on your own hardware, and keep the API for the stuff that actually benefits from it. We'll build exactly that routing layer later in this article.
Setting Up Your First Local LLM with Ollama and Docker
Ollama wraps model downloading, quantization, and an OpenAI-compatible API server into a single binary. Think of it as the Docker of LLMs: pull a model, run it, hit an endpoint. Five minutes from install to first response.
Prerequisites
How much hardware you need depends on which Qwen 3.5 variant you pick:
Qwen 3.5 4B โ 8GB RAM minimum. Runs on integrated GPUs and Apple Silicon just fine. 3.4GB download.
Qwen 3.5 9B (the default) โ 16GB RAM minimum. This is the sweet spot for most developers. 6.6GB download.
Qwen 3.5 27B โ 32GB RAM or a GPU with 24GB VRAM. Gets close to frontier quality on most benchmarks. 17GB download.
Qwen 3.5 35B-A3B (MoE) โ 35B total parameters, but only 3B activate per forward pass. Needs 32GB RAM but throughput is surprisingly good for a model this size. 24GB download.
NVIDIA cards with 8GB+ VRAM handle the 9B model well. Apple Silicon Macs with 16GB unified memory work too, since Ollama has native Metal support.
Install Ollama and Pull Qwen 3.5
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull Qwen 3.5 (default 9B model, 6.6GB download)
ollama pull qwen3.5
# Or pull a specific size
ollama pull qwen3.5:27b
# Verify the model is available
ollama listDocker Compose Setup
For anything beyond local tinkering, running Ollama in Docker keeps your host clean and makes the whole setup reproducible.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_data:# Start the container
docker compose up -d
# Pull the model inside the container
docker compose exec ollama ollama pull qwen3.5Test with a cURL Request
Ollama exposes an OpenAI-compatible API on port 11434. This is the detail that makes everything else in this article work: any SDK, framework, or tool built for OpenAI's API will talk to your local model if you swap out the base URL.
# Chat completion (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"messages": [
{"role": "user", "content": "Explain dependency injection in three sentences."}
],
"temperature": 0.7
}'Quick Benchmarking
Before you commit to a model size, benchmark it on your actual hardware. The two numbers that matter are tokens per second and time to first token.
# Ollama shows performance stats with verbose mode
ollama run qwen3.5 --verbose "Write a function that reverses a linked list in Python"
# Look for these lines in the output:
# eval rate: ~45 tokens/s (varies by hardware)
# prompt eval duration: ~120msOn an M2 MacBook Pro with 16GB, I see roughly 40-50 tokens per second from the 9B model. An RTX 4090 pushes that to 80-120 tok/s. Apple's M3 Pro and M4 chips with 36GB unified memory handle the 27B model comfortably at 25-35 tok/s, which is still perfectly usable for development and even light production. Anything above 30 tokens per second feels responsive in a streaming chat UI. Drop below 15, and users start noticing.
Integrating Local Models into Your Laravel + Next.js Stack
Here's where the OpenAI compatibility pays off. The OpenAI PHP SDK accepts a custom base URL. Point it at localhost:11434 instead of api.openai.com, and every existing call in your codebase just... works. No new dependencies, no wrapper libraries, no SDK migration.
Laravel Backend with the OpenAI PHP SDK
// config/services.php
'llm' => [
'local' => [
'url' => env('OLLAMA_URL', 'http://localhost:11434/v1'),
'model' => env('OLLAMA_MODEL', 'qwen3.5'),
],
'cloud' => [
'key' => env('OPENAI_API_KEY'),
'model' => env('OPENAI_MODEL', 'gpt-4.1-mini'),
],
],// app/Services/LlmService.php
namespace App\Services;
use OpenAI;
class LlmService
{
private $localClient;
private $cloudClient;
public function __construct()
{
$this->localClient = OpenAI::factory()
->withBaseUri(config('services.llm.local.url'))
->withApiKey('ollama') // Ollama ignores this, but the SDK requires it
->make();
$this->cloudClient = OpenAI::client(config('services.llm.cloud.key'));
}
public function complete(string $prompt, string $backend = 'local'): string
{
$client = $backend === 'local' ? $this->localClient : $this->cloudClient;
$model = $backend === 'local'
? config('services.llm.local.model')
: config('services.llm.cloud.model');
$response = $client->chat()->create([
'model' => $model,
'messages' => [
['role' => 'user', 'content' => $prompt],
],
]);
return $response->choices[0]->message->content;
}
}Streaming Responses to Next.js
Nobody wants to stare at a blank screen while the model thinks. For chat interfaces, stream tokens to the frontend using the Vercel AI SDK.
// app/api/chat/route.ts (Next.js App Router)
import { streamText } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
const localProvider = createOpenAI({
baseURL: process.env.OLLAMA_URL || 'http://localhost:11434/v1',
apiKey: 'ollama',
});
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: localProvider('qwen3.5'),
messages,
});
return result.toDataStreamResponse();
}The beauty of this adapter pattern is that swapping backends becomes a config change, not a code change. Dev environment runs against Ollama locally. Staging points at OpenAI. Production uses both, which is exactly what we'll set up next.
The Hybrid Approach: Smart Routing Between Local and Cloud
Why pick one backend when you can use both? The idea is simple: cheap, high-volume tasks run on your local GPU. Tasks that genuinely need frontier-level reasoning go to the cloud API. A small routing layer in your backend handles the decision, and the rest of your code never needs to know which model answered.
Building a Model Router
// app/Services/ModelRouter.php
namespace App\Services;
class ModelRouter
{
private const LOCAL_TASKS = [
'classify', 'extract', 'summarize',
'translate', 'format', 'parse',
];
public function resolve(string $task, int $inputTokens): string
{
// Long-context or complex reasoning โ cloud
if ($inputTokens > 8000) {
return 'cloud';
}
// Known simple tasks โ local
if (in_array($task, self::LOCAL_TASKS, true)) {
return 'local';
}
// Creative, multi-step, or ambiguous โ cloud
return 'cloud';
}
}Wire this into the LlmService, and classification, extraction, translation, and formatting all run on your hardware at zero marginal cost. Creative generation, complex reasoning, and long documents still go to the cloud.
One thing I'd strongly recommend: log every routing decision. Task type, token count, backend chosen, latency, and whatever quality signal you can get. After a few weeks of data, you'll spot patterns that let you confidently shift more work to local. We started with 60% local routing and pushed it to 78% within a month once we saw the quality numbers.
Most SaaS workloads are dominated by the "boring" stuff. Input validation, entity extraction, content tagging, auto-replies to FAQ-type questions. A typical app might route 75% of LLM calls locally, sending only the remaining quarter to the cloud for marketing copy, long document analysis, or tricky customer escalations. Your API bill drops accordingly.
For teams that want to push the local percentage even higher, Unsloth makes fine-tuning surprisingly accessible. Fine-tune Qwen 3.5 9B on your domain-specific data, and you can close the quality gap for your particular use case without needing a cluster of A100s. That's an advanced move, but worth exploring once the basic routing is working.
Where to Go from Here
A year ago, self-hosting an LLM meant wrestling with CUDA drivers, writing custom inference code, and accepting noticeably worse output. That era is over. Qwen 3.5 competes with models that cost real money per token, Ollama makes serving it trivial, and the OpenAI-compatible API means your existing code barely needs to change.
Once you outgrow Ollama's single-machine setup, look at vLLM for high-throughput serving with continuous batching, or Hugging Face's Text Generation Inference if you're deploying on Kubernetes. Different tools for different scales, same core idea: you own the inference.
Deciding when to self-host, when to call an API, and how to build the routing layer between them is a product decision as much as a technical one. The AI Product Manager course at SkillHub covers the strategic side of that equation, including cost modeling and vendor evaluation frameworks. And if you're scaling the infrastructure behind it, the Highload Software Architecture course goes deep on the distributed systems patterns that keep self-hosted AI reliable under production traffic.
Start with Ollama and the 9B model. Measure your throughput. Route the high-volume stuff locally. Keep the API for what genuinely needs it. You might be surprised how little that turns out to be.
