AZMX AI

Guide · 2026-05-25 · 8 min read

Local Inference with llama.cpp

A no-nonsense guide to compiling, quantifying, and running large language models on your own metal.

Running models locally is no longer a niche hobby. With the maturation of GGUF and the efficiency of llama.cpp, you can run 7B to 70B parameter models on standard laptops. This guide avoids the fluff and focuses on the binary execution and memory management required to get a model responding on your terminal without sending a single packet to a third-party API.

The Core Premise of llama.cpp

The primary goal of llama.cpp is to provide a high-performance LLM inference engine written in C/C++ with minimal dependencies. It utilizes 4-bit and 8-bit quantization to shrink model weights, allowing them to fit into VRAM or system RAM. Unlike Python-heavy stacks, llama.cpp is designed for raw speed and portability across macOS, Windows, and Linux.

Prerequisites

  • A C++ compiler (GCC, Clang, or MSVC).
  • CMake 3.13 or newer.
  • A model file in GGUF format (the current standard for llama.cpp).

Step 1: Build from Source

While pre-built binaries exist, compiling from source ensures you utilize the specific instruction sets of your CPU (e.g., AVX2, AVX-512) or GPU (CUDA, Metal). Run the following in your terminal:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

For macOS users, Metal support is enabled by default, allowing the model to offload layers to the Apple Silicon GPU. For NVIDIA users, add -DGGML_CUDA=ON to the cmake command to enable CUDA acceleration.

Step 2: Acquiring GGUF Models

You cannot use raw PyTorch .bin or Safetensors files directly in llama.cpp. You need the GGUF format. You can either convert them using the provided convert_hf_to_gguf.py script or download pre-quantized versions from Hugging Face. Look for the Q4_K_M quantization; it generally provides the best balance between perplexity (intelligence) and memory usage.

Step 3: Running the Inference

Use the main binary to start a chat session. The -m flag specifies the model path, and -n sets the number of tokens to predict.

./build/bin/llama-cli -m models/llama-3-8b-Q4_K_M.gguf -p "Explain quantum entanglement in one sentence." -n 128

To run a persistent interactive session, use the -i flag. If you have a GPU, use -ngl (number of GPU layers) to offload parts of the model to VRAM. For a 7B model, -ngl 32 usually offloads the entire model.

Memory Management and Quantization

Understanding the trade-off between quantization levels is critical. A 7B parameter model in FP16 requires ~14 GB of RAM. A 4-bit quantization (Q4) reduces this to ~4-5 GB. If you encounter out of memory errors, move to a lower quantization like Q2_K, though expect a noticeable drop in reasoning capabilities.

Integrating with Agentic Workflows

Running a raw CLI is useful for testing, but for production-grade development, you need an interface that handles project context and tool use. Most developers use a local server mode:

./build/bin/llama-server -m models/llama-3-8b-Q4_K_M.gguf --port 8080

This exposes an OpenAI-compatible API. You can then connect this endpoint to various AI editors. While tools like Cursor or GitHub Copilot rely heavily on cloud-based models, sovereign-first developers often prefer local endpoints to prevent IP leakage.

For those who need a full IDE experience with a local backend, AZMX AI integrates directly with local providers via LM Studio or Ollama (which are wrappers around llama.cpp). By pointing AZMX to your local server, you get a PTY terminal and a CodeMirror 6 editor without your code ever leaving your machine. This is a more streamlined alternative to configuring Aider or Cline manually with a local LLM backend.

Comparison with Other Local Runtimes

How does llama.cpp stack up against other options?

  • Ollama: A wrapper around llama.cpp. It simplifies model management (ollama run llama3) but hides the granular control of -ngl and sampling parameters.
  • vLLM: Optimized for high-throughput serving on Linux/NVIDIA. Much faster for multiple concurrent users but requires significantly more VRAM and lacks the CPU-only flexibility of llama.cpp.
  • ExLlamaV2: Specifically optimized for GPUs. Faster than llama.cpp on NVIDIA hardware but cannot run on CPUs.

Security Considerations

Local LLMs are inherently more secure than cloud APIs, but they are not risk-free. Be cautious of the models you download from public repositories. While GGUF is a data format and not an executable, always verify the hash of the model file. When using local models with agents, ensure your agent has a strict deny-list for sensitive directories like .ssh or .env to prevent the model from accidentally reading credentials into its context window. You can read more about this approach in the AZMX security documentation.

Summary Checklist

  1. Clone llama.cpp and build with the appropriate hardware flags (Metal/CUDA).
  2. Download a Q4_K_M GGUF model from Hugging Face.
  3. Execute via llama-cli for quick tests or llama-server for API integration.
  4. Offload as many layers as possible to the GPU using -ngl.
  5. Connect the API endpoint to a sovereign agent platform for actual coding work.

One window. The whole loop.