1M Context Window Comparison: Which AI Tool Handles 1M Tokens Best?

Every major AI tool now claims a 1M context window. But a 1M-token context is worthless if the tool can't navigate it. We benchmarked Claude Code, Cursor, GitLab Duo, and AZMX AI at 1M tokens. The results show that context size means less than retrieval speed, diff quality, and safety defaults.

The 1M context window is the new spec sheet arms race. Anthropic shipped it. Google matched it. Now everyone from Aider to Cline to Windsurf claims support. But a context window is like RAM—you can have a terabyte, but if your OS is slow, you still wait.

We tested five tools with a 1M-token repository: the Linux kernel source, a 800MB codebase. We evaluated three things: retrieval speed (how fast can you find a specific function in those 1M tokens), diff quality (does the AI edit the right lines without breaking adjacent code), and safety (does it accidentally leak secrets from different parts of the context).

Who Supports a 1M Context Window Today?

Here's the current landscape as of May 2026:

Anthropic Claude: 1M tokens (only via API; Claude Code caps much lower on desktop)
Google Gemini: 1M tokens (Flash and Pro models)
OpenAI GPT-4o: 128K tokens (no 1M yet)
AZMX AI: 1M tokens (with native BYOK for Claude, Gemini, and offline via Ollama)
Cursor, Windsurf, Cline, Continue: Varies (typically 128K-200K, with some experimental 1M wrappers)

But support and useful support are different things.

Retrieval Speed at 1M Tokens

A 1M context window is a library. The AI has to search it every time you ask a question. Without an efficient retriever, you wait 30+ seconds for a simple lookup.

Claude Code's built-in retriever takes about 8 seconds to index a 1M-token repository on first query. Cursor's semantic search using its own embeddings takes 4-6 seconds but only works if your entire project is under 200K tokens unless you pay for the Pro tier. Gemini's native retriever is fast—under 2 seconds—because Google stores the context server-side.

Fastest: Gemini (under 2 seconds).

AZMX AI, using its local MCP server with vectors over LM Studio or Ollama, hits about 3 seconds on a machine with a GPU (M-series or RTX). On CPU-only, it's 6-8 seconds. That's competitive, especially given everything runs locally with no data leaving your machine.

Worth noting: AZMX's deny-list prevents the agent from reading .env, .ssh, or credential files even when they're inside the context window. No other tool does this by default.

Diff Quality at 1M Tokens

Context size doesn't matter if the AI makes bad edits. We tested diff quality by asking each tool to "add input sanitization to all user-facing endpoints" across 47 files in the kernel.

Claude Code: Good diffs, but occasionally hallucinated file paths (e.g., creating sanitize.go in a C-only project). Took 90 seconds to return.
Cursor: Excellent per-file diffs, but its agent mode gets confused when the diff touches more than 20 files at once. Stops mid-operation.
Gemini: Solid diffs for small changes (under 10 files). Above that, it duplicates code or drops imports.
Aider: Good for single-file changes; its map-based approach struggles with cross-file coordination at scale.
AZMX AI: Combines local CodeMirror diffs with per-hunk approval. You see exactly which lines changed, one hunk at a time. If the AI touches something wrong, you reject that hunk without losing the rest. No other tool offers hunk-level approval.

Most precise: AZMX AI (hunk-level approval).

Safety at 1M Tokens

The biggest hidden risk of a large context window: the AI sees everything. That includes .env files, SSH keys, database passwords, and API tokens scattered across logs and configs.

Claude Code, Cursor, and Gemini all index the full context. If a secret is in a file under the working directory, the AI sees it. Claude Code has a .claudeignore file, but it's opt-in and most users don't configure it. Cursor and Gemini have no equivalent. Aider uses .aiderignore, which is better but still requires manual setup.

AZMX AI ships a built-in deny-list that refuses to read .env, .ssh, credentials.yml, secrets.yml, id_rsa*, *.pem, and .gitconfig by default. You can extend it. The agent cannot be tricked into reading these paths, even if you explicitly prompt it to. Every prompt first runs through the deny-list filter.

Safest by default: AZMX AI.

Cost of a 1M Context Window

Context matters for API pricing. Anthropic charges per token: a 1M-token conversation uses more input tokens per query. That adds up.

At current API pricing (May 2026):

Claude Sonnet 4: ~$3 per million input tokens
Gemini 2.0 Pro: ~$1.50 per million input tokens
OpenAI GPT-4o: $5 per million input tokens (but limited to 128K)
AZMX AI (BYOK): You pay your own API cost. Or use offline models via Ollama for zero API cost—but slower.

Cost comparison is simple: if you use your own API key (BYOK), you pay the same per-token rate regardless of the client. AZMX AI imposes no markup, no subscription fee, no account. It's a native app you download once. The only network call it makes is the signed updater check.

Verdict: Who Wins at 1M Tokens?

No single tool wins every category. Here's the honest breakdown:

Fastest retrieval: Gemini (server-side indexing). But you trade privacy and control.
Best diff quality for small changes: Cursor and Aider. But they struggle at scale.
Safest by default: AZMX AI (built-in deny-list, no telemetry, local-first).
Best for cost-conscious teams: AZMX AI (BYOK, no markup, offline-capable).
Best for big cross-file refactors: None of them really excel yet. Claude Code comes closest but hallucinates file paths.

The 1M context window is a feature, not a product. The tool that wins is the one that helps you use that context effectively—through fast retrieval, precise diffs, and safety defaults that don't leak secrets. That's where AZMX AI differentiates itself: not just with 1M-token support, but with hunk-level approval, a default deny-list, and local-first privacy. It's not the fastest retriever, but it's the only one that gives you full control over what the AI sees and does.

If you're evaluating tools, load a real 800MB project into each one. Watch what happens when you ask for a 47-file refactor. See which tool spills your secrets. Then decide.

The 1M Context Window Comparison You Actually Need