Local AI Models

Run AI models locally — Ollama, LM Studio, vLLM

Running AI models locally means zero API costs, complete data privacy, and no internet dependency. Ollama is the easiest tool — download and run models like Llama 3.3, Mistral, Phi-4, and Gemma 3 with a single command. LM Studio provides a polished GUI. vLLM offers high-throughput production serving on GPU clusters.

Website →Docs →GitHub →

Key Features

Ollama — One-Command Install

Install and run Llama 3.3, Mistral, Phi-4, Gemma 3, and 100+ models with a single terminal command

Complete Data Privacy

All inference runs on your hardware — no data ever sent to external APIs, ideal for sensitive workloads

Zero Ongoing API Costs

One-time hardware investment; run unlimited queries with no per-token billing

LM Studio GUI

User-friendly desktop app for discovering, downloading, and chatting with local models on Mac, Windows, Linux

Open weights

Featured open-source models

High-impact models you can run locally — phone-class to cluster-scale. Headline number is the largest parameter count; expand Details for runtimes, hardware notes, and watch-outs.

Showing 10 / 10 models

1.6T

Reasoning

MIT

DeepSeek V4

DeepSeek · 2026-04

Variants: V4-Pro (1.6T MoE / 49B active), V4-Flash (284B MoE / 13B active)
Context: 1M (384K max output)
Best for: Server

Details

Runtimes

NVIDIA NIM · vLLM · SGLang · TensorRT-LLM

Hardware

Optimised for NVIDIA Blackwell — Pro hits 150+ tok/s/user on GB200 NVL72; Flash workable on 8× H100

Highlights

Hybrid attention (CSA + DSA + HCA): 73% fewer FLOPs and 90% less KV-cache vs V3.2
Frontier-class reasoning, coding, and tool calling at MIT-licensed open weights
Flash variant viable for fast routing/summarization at lower infra cost

Watch-outs

Self-hosting Pro requires Blackwell-class infra
Default prompt still censors PRC-sensitive topics

Project home

671B

Reasoning

MIT

DeepSeek R1

DeepSeek · 2026-01

Variants: 1.5B / 7B / 14B / 32B / 70B distilled, 671B MoE
Context: 128K
Best for: WorkstationServer

Details

Runtimes

Ollama · vLLM · llama.cpp · LM Studio

Hardware

7B fits 16GB VRAM Q4; 70B distilled needs 2× A100 / single H100

Highlights

Open reasoning chain-of-thought, near-frontier on math/code
Distilled checkpoints competitive with much larger closed models
Aggressive pricing on hosted API ($0.55 / $2.19 per 1M)

Watch-outs

Censored on PRC-sensitive topics in default prompt
Compliance teams should review data-residency before hosted use

Project home

671B

Text generation

DeepSeek License

DeepSeek V3

DeepSeek · 2025-12

Variants: 671B MoE (37B active)
Context: 128K
Best for: Server

Details

Runtimes

vLLM · SGLang · TensorRT-LLM

Hardware

8× H100 / 8× MI300X recommended for full precision

Highlights

MoE architecture — strong throughput per dollar
Strong multilingual + coding performance

Watch-outs

Heavy infra needed for self-hosting

Project home

72B

Text generation

Apache 2.0

Qwen 3

Alibaba · 2026-02

Variants: 0.5B / 1.8B / 7B / 14B / 32B / 72B + Coder + Math
Context: 128K (extendable to 1M with YaRN)
Best for: MobileWorkstationServer

Details

Runtimes

Ollama · vLLM · llama.cpp · MLX (Apple)

Hardware

0.5B/1.8B run on phones; 14B fits 24GB VRAM; 72B needs A100 80GB

Highlights

Best-in-class multilingual coverage (29+ languages)
Coder + Math fine-tunes among top open models
Truly open license suitable for commercial use

Watch-outs

Tokenizer can be inefficient on Latin scripts

Project home

Multimodal

Llama 4 Community License

Llama 4

Meta · 2026-04

Variants: Scout 17B (active), Maverick 400B MoE, Behemoth 2T (preview)
Context: 10M (Scout), 1M (Maverick)
Best for: WorkstationServer

Details

Runtimes

vLLM · TensorRT-LLM · Ollama (Scout)

Hardware

Scout fits single H100; Maverick needs 8× H100

Highlights

Native multimodal (text + image + video frames)
Industry-leading 10M context on Scout
Strong tool use and structured output

Watch-outs

700M MAU clause in license — check before mass deployment
Behemoth still in preview

Project home

130B

Text generation

Apache 2.0

GLM 4.6

Zhipu / Tsinghua · 2026-03

Variants: 9B, 32B, 130B
Context: 200K
Best for: WorkstationServer

Details

Runtimes

vLLM · llama.cpp · Ollama

Hardware

32B at Q4 fits 24GB VRAM; 130B needs 2× A100 80GB

Highlights

Strong Chinese ↔ English bilingual quality
Good agent / tool-calling fine-tunes

Watch-outs

Less ecosystem tooling than Llama / Qwen

Project home

Text generation

Mistral Research License

Ministral

Mistral AI · 2025-10

Variants: 3B, 8B (Ministral 3B / 8B Instruct)
Context: 128K
Best for: MobileWorkstation

Details

Runtimes

Ollama · llama.cpp · MLC LLM (mobile)

Hardware

3B runs on phones / Raspberry Pi 5; 8B fits 16GB VRAM Q4

Highlights

Best-in-class on-device performance for sub-10B parameter range
Strong function-calling support

Watch-outs

Production commercial use of 3B requires paid license

Project home

—

Image generation

FLUX.1 Non-Commercial

FLUX.1

Black Forest Labs · 2024-08, Kontext 2026-02

Variants: [dev], [schnell], [pro] (hosted), Kontext (edit)
Context: n/a (image)
Best for: WorkstationServer

Details

Runtimes

ComfyUI · Diffusers · SwarmUI

Hardware

12GB VRAM for [schnell] at 1024px; 24GB for [dev] full quality

Highlights

Best open-weight text-to-image model — rivals Midjourney v6
Strong typography rendering and prompt adherence
Kontext variant adds image-conditioned editing

Watch-outs

[dev] license blocks commercial product use — use [schnell] or pro API

Project home

Text generation

Modified MIT

Kimi K2

Moonshot AI · 2026-03

Variants: 1T MoE (32B active)
Context: 2M
Best for: Server

Details

Runtimes

vLLM · SGLang

Hardware

Trillion-param scale — needs 8× H200 / B200 cluster

Highlights

Frontier-class agentic capabilities
2M context for codebase-scale comprehension
Open weights at a scale previously closed-only

Watch-outs

Self-hosting impractical for most teams — use hosted API
Massive infra cost

Project home

27B

Multimodal

Gemma Terms of Use

Gemma 4

Google DeepMind · 2026-04

Variants: 270M (mobile), 2B, 9B, 27B + CodeGemma + RecurrentGemma
Context: 128K (27B), 32K (smaller variants)
Best for: MobileWorkstationServer

Details

Runtimes

Ollama · llama.cpp · MLC LLM · MediaPipe (Android) · Core ML (iOS)

Hardware

270M runs on smartwatches; 9B fits 12GB VRAM Q4; 27B needs 24GB+

Highlights

First-class on-device support — Android NNAPI / iOS Core ML
Multimodal vision input on 9B/27B
Strong safety + responsible AI baseline

Watch-outs

Use-policy restrictions on certain content domains
Smaller community than Llama

Project home

Latest Updates

All updates

feature

Ollama v0.32.4 Introduces MLX Apple GPU Support for Laguna

Ollama v0.32.4 adds MLX Apple GPU support for Laguna and optimizes Qwen3 MoE decoding.

27 July 2026Source

feature

Guides

All guides

getting startedbeginnerFeatured

Run Local AI Models with Ollama

Install Ollama and run Llama 3.3, Phi-4, and other open models locally with zero API costs.

8 min readRead guide →

Pricing

Ollama (Free)

Requires compatible hardware

All open models
Mac/Windows/Linux
OpenAI-compatible API
Unlimited local queries

LM Studio (Free)

Free for personal use

GUI model browser
Multi-model chat

Local AI Models

Key Features

Ollama — One-Command Install

Complete Data Privacy

Zero Ongoing API Costs

LM Studio GUI

Featured open-source models

DeepSeek V4

DeepSeek R1

DeepSeek V3

Qwen 3

Llama 4

GLM 4.6

Ministral

FLUX.1

Kimi K2

Gemma 4

Latest Updates

Ollama v0.32.4 Introduces MLX Apple GPU Support for Laguna

Guides

Run Local AI Models with Ollama

Pricing

Ollama (Free)

LM Studio (Free)

OpenAI-Compatible API

vLLM Production Serving

LM Studio 0.4.20 Adds Enterprise Endpoints and Bionic Model Sharing

LocalAI v4.7.1 Release Brings Voice Cloning and Avatar Generation

vLLM (Self-hosted)