Guide · 2026-05-25 · 8 min read
Local Inference with llama.cpp
A no-nonsense guide to compiling, quantifying, and running large language models on your own metal.
Running models locally is no longer a niche hobby. With the maturation of GGUF and the efficiency of llama.cpp, you can run 7B to 70B parameter models on standard laptops. This guide avoids the fluff and focuses on the binary execution and memory management required to get a model responding on your terminal without sending a single packet to a third-party API.
The Core Premise of llama.cpp
The primary goal of llama.cpp is to provide a high-performance LLM inference engine written in C/C++ with minimal dependencies. It utilizes 4-bit and 8-bit quantization to shrink model weights, allowing them to fit into VRAM or system RAM. Unlike Python-heavy stacks, llama.cpp is designed for raw speed and portability across macOS, Windows, and Linux.
Prerequisites
- A C++ compiler (GCC, Clang, or MSVC).
- CMake 3.13 or newer.
- A model file in GGUF format (the current standard for llama.cpp).
Step 1: Build from Source
While pre-built binaries exist, compiling from source ensures you utilize the specific instruction sets of your CPU (e.g., AVX2, AVX-512) or GPU (CUDA, Metal). Run the following in your terminal:
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release
For macOS users, Metal support is enabled by default, allowing the model to offload layers to the Apple Silicon GPU. For NVIDIA users, add -DGGML_CUDA=ON to the cmake command to enable CUDA acceleration.
Step 2: Acquiring GGUF Models
You cannot use raw PyTorch .bin or Safetensors files directly in llama.cpp. You need the GGUF format. You can either convert them using the provided convert_hf_to_gguf.py script or download pre-quantized versions from Hugging Face. Look for the Q4_K_M quantization; it generally provides the best balance between perplexity (intelligence) and memory usage.
Step 3: Running the Inference
Use the main binary to start a chat session. The -m flag specifies the model path, and -n sets the number of tokens to predict.
./build/bin/llama-cli -m models/llama-3-8b-Q4_K_M.gguf -p "Explain quantum entanglement in one sentence." -n 128
To run a persistent interactive session, use the -i flag. If you have a GPU, use -ngl (number of GPU layers) to offload parts of the model to VRAM. For a 7B model, -ngl 32 usually offloads the entire model.
Memory Management and Quantization
Understanding the trade-off between quantization levels is critical. A 7B parameter model in FP16 requires ~14 GB of RAM. A 4-bit quantization (Q4) reduces this to ~4-5 GB. If you encounter out of memory errors, move to a lower quantization like Q2_K, though expect a noticeable drop in reasoning capabilities.
Integrating with Agentic Workflows
Running a raw CLI is useful for testing, but for production-grade development, you need an interface that handles project context and tool use. Most developers use a local server mode:
./build/bin/llama-server -m models/llama-3-8b-Q4_K_M.gguf --port 8080
This exposes an OpenAI-compatible API. You can then connect this endpoint to various AI editors. While tools like Cursor or GitHub Copilot rely heavily on cloud-based models, sovereign-first developers often prefer local endpoints to prevent IP leakage.
For those who need a full IDE experience with a local backend, AZMX AI integrates directly with local providers via LM Studio or Ollama (which are wrappers around llama.cpp). By pointing AZMX to your local server, you get a PTY terminal and a CodeMirror 6 editor without your code ever leaving your machine. This is a more streamlined alternative to configuring Aider or Cline manually with a local LLM backend.
Comparison with Other Local Runtimes
How does llama.cpp stack up against other options?
- Ollama: A wrapper around llama.cpp. It simplifies model management (
ollama run llama3) but hides the granular control of-ngland sampling parameters. - vLLM: Optimized for high-throughput serving on Linux/NVIDIA. Much faster for multiple concurrent users but requires significantly more VRAM and lacks the CPU-only flexibility of llama.cpp.
- ExLlamaV2: Specifically optimized for GPUs. Faster than llama.cpp on NVIDIA hardware but cannot run on CPUs.
Security Considerations
Local LLMs are inherently more secure than cloud APIs, but they are not risk-free. Be cautious of the models you download from public repositories. While GGUF is a data format and not an executable, always verify the hash of the model file. When using local models with agents, ensure your agent has a strict deny-list for sensitive directories like .ssh or .env to prevent the model from accidentally reading credentials into its context window. You can read more about this approach in the AZMX security documentation.
Summary Checklist
- Clone llama.cpp and build with the appropriate hardware flags (Metal/CUDA).
- Download a Q4_K_M GGUF model from Hugging Face.
- Execute via
llama-clifor quick tests orllama-serverfor API integration. - Offload as many layers as possible to the GPU using
-ngl. - Connect the API endpoint to a sovereign agent platform for actual coding work.