AZMX AI

Howto · 2026-05-21 · 10 min read

How to run LLMs locally in 2026.

A friction-honest walkthrough for running coding-grade models offline on a laptop you already own — Ollama, LM Studio, and the agent that talks to them.

In 2024, "run LLMs locally" meant a hobby project. In 2026, it's a serious workflow for a lot of working engineers — especially the ones who can't paste production code into a vendor chatbot. This is the walkthrough we'd hand a colleague on day one.

Why bother

  • Nothing leaves the machine. No vendor in the loop, no token bill, no retention policy to read.
  • Latency is local. First token in 100 ms, no network jitter.
  • Air-gapped works. Same setup on a plane, on a regulated network, in a SCIF.
  • You learn what the model can do. No system prompt you can't read, no router you can't see.

The hardware floor

The honest version: in 2026 you need at least 16 GB of unified memory or VRAM to do useful coding work locally. 24 GB is comfortable. 32 GB or more lets you run the larger Qwen and DeepSeek coder models at sensible quantizations. Apple Silicon (M2 or later) and any modern NVIDIA card both work. Intel/AMD CPU-only is possible but slow.

The two runners — Ollama and LM Studio

Ollama

CLI-first. Trivial to script. ollama pull qwen2.5-coder:14b and you're done. Listens on http://localhost:11434 with an OpenAI-compatible API. The default choice for engineers who don't want a GUI.

brew install ollama
ollama serve            # in one terminal
ollama pull qwen2.5-coder:14b
ollama run qwen2.5-coder:14b

LM Studio

GUI-first. A model browser, a chat playground, a one-click local server. The right pick if you want to feel out a model before committing it to a workflow. Same OpenAI-compatible API surface on a port you choose.

Both ship server modes that any modern agent — including AZMX AI — can point at. You're not locked in.

Which model to start with

  • Qwen2.5-Coder 14B — the workhorse for general coding tasks on a 16 GB machine. Punches well above its weight.
  • Qwen2.5-Coder 32B — if you have 32 GB or more. Genuinely competitive with frontier models on many tasks.
  • DeepSeek-Coder-V2 — strong on architecture-level reasoning; heavier.
  • Llama 3.3 70B — best generalist if you have the RAM and patience.

Pick one. Use it for a week. Then try another. The "best" model on benchmarks isn't always the one your codebase agrees with.

Point AZMX AI at your local model

In Settings → AI, add a provider with the OpenAI-compatible adapter and the base URL of your runner:

Provider: openai-compatible
Base URL: http://localhost:11434/v1   (Ollama)
          http://localhost:1234/v1    (LM Studio default)
API key:  any string

From there, your agent runs against a model that lives on your machine. The network call is to localhost. No telemetry. No cloud round-trip. The first useful exchange never leaves the laptop.

What local models still can't do as well

Be honest: as of 2026, frontier models still have an edge on very long reasoning chains and on the trickiest novel problems. The gap is shrinking faster than most people expect, but it's there. The 2026 sweet spot is hybrid: local for the 80% of tasks that don't need frontier capability, BYOK frontier for the rest — both routed through one agent under one approval gate. AZMX AI is built for exactly that flow.

Get started with AZMX AI · Why BYOK pairs naturally with local

Bring any model. Or run no model at all online.

Point AZMX AI at localhost and the first useful exchange never leaves your machine.