Run AI models locally — Ollama, LM Studio, vLLM
Running AI models locally means zero API costs, complete data privacy, and no internet dependency. Ollama is the easiest tool — download and run models like Llama 3.3, Mistral, Phi-4, and Gemma 3 with a single command. LM Studio provides a polished GUI. vLLM offers high-throughput production serving on GPU clusters.
Install and run Llama 3.3, Mistral, Phi-4, Gemma 3, and 100+ models with a single terminal command
All inference runs on your hardware — no data ever sent to external APIs, ideal for sensitive workloads
One-time hardware investment; run unlimited queries with no per-token billing
User-friendly desktop app for discovering, downloading, and chatting with local models on Mac, Windows, Linux
Open weights
High-impact models you can run locally — phone-class to cluster-scale. Headline number is the largest parameter count; expand Details for runtimes, hardware notes, and watch-outs.
Showing 10 / 10 models
1.6T
Reasoning
DeepSeek · 2026-04
Runtimes
NVIDIA NIM · vLLM · SGLang · TensorRT-LLM
Hardware
Optimised for NVIDIA Blackwell — Pro hits 150+ tok/s/user on GB200 NVL72; Flash workable on 8× H100
Highlights
Watch-outs
671B
Reasoning
DeepSeek · 2026-01
Runtimes
Ollama · vLLM · llama.cpp · LM Studio
Hardware
7B fits 16GB VRAM Q4; 70B distilled needs 2× A100 / single H100
Highlights
Watch-outs
671B
Text generation
DeepSeek · 2025-12
Runtimes
vLLM · SGLang · TensorRT-LLM
Hardware
8× H100 / 8× MI300X recommended for full precision
Highlights
Watch-outs
72B
Text generation
Alibaba · 2026-02
Runtimes
Ollama · vLLM · llama.cpp · MLX (Apple)
Hardware
0.5B/1.8B run on phones; 14B fits 24GB VRAM; 72B needs A100 80GB
Highlights
Watch-outs
2T
Multimodal
Meta · 2026-04
Runtimes
vLLM · TensorRT-LLM · Ollama (Scout)
Hardware
Scout fits single H100; Maverick needs 8× H100
Highlights
Watch-outs
130B
Text generation
Zhipu / Tsinghua · 2026-03
Runtimes
vLLM · llama.cpp · Ollama
Hardware
32B at Q4 fits 24GB VRAM; 130B needs 2× A100 80GB
Highlights
Watch-outs
8B
Text generation
Mistral AI · 2025-10
Runtimes
Ollama · llama.cpp · MLC LLM (mobile)
Hardware
3B runs on phones / Raspberry Pi 5; 8B fits 16GB VRAM Q4
Highlights
Watch-outs
—
Image generation
Black Forest Labs · 2024-08, Kontext 2026-02
Runtimes
ComfyUI · Diffusers · SwarmUI
Hardware
12GB VRAM for [schnell] at 1024px; 24GB for [dev] full quality
Highlights
Watch-outs
1T
Text generation
Moonshot AI · 2026-03
Runtimes
vLLM · SGLang
Hardware
Trillion-param scale — needs 8× H200 / B200 cluster
Highlights
Watch-outs
27B
Multimodal
Google DeepMind · 2026-04
Runtimes
Ollama · llama.cpp · MLC LLM · MediaPipe (Android) · Core ML (iOS)
Hardware
270M runs on smartwatches; 9B fits 12GB VRAM Q4; 27B needs 24GB+
Highlights
Watch-outs
New API authentication middleware supporting API keys and JWTs has been released for enhanced security.
$0
Requires compatible hardware
$0
Free for personal use
Ollama and LM Studio expose OpenAI-compatible REST endpoints — drop-in replacement for existing code
High-throughput inference server with continuous batching, 4-bit quantization, and multi-GPU support
Local AI Models now offers advanced 4-bit and 2-bit quantization to reduce model memory requirements.
Hardware cost only
Open-source Apache 2.0