AZMX AI

Engineering · 2026-05-27 · 8 min read

Observability with OpenTelemetry for LLMs

Moving beyond simple logging to structured tracing for non-deterministic AI agent loops.

Standard logging fails when debugging agentic workflows because a single user request can trigger dozens of recursive LLM calls, tool executions, and state updates. OpenTelemetry (OTel) provides a vendor-neutral framework to trace these spans, allowing engineers to visualize the exact path an agent took before it failed or hallucinated.

The Problem with LLM Black Boxes

Traditional APM tools track latency and error rates. However, LLM applications introduce a new failure mode: the semantic error. An agent might return a 200 OK response while providing a logically incorrect answer or entering an infinite loop of tool calls. Without structured tracing, debugging these issues requires manually parsing massive JSON logs to reconstruct the sequence of events.

Why OpenTelemetry?

OpenTelemetry for LLMs standardizes how we record inputs, outputs, and metadata. Instead of custom logs, OTel uses spans. A single request becomes a trace, and every LLM call or MCP tool execution becomes a child span. This allows you to see exactly where a prompt failed or which tool returned the malformed data that confused the model.

Implementing OTel Semantic Conventions

To get value from OTel, you must follow semantic conventions. This ensures that your traces are compatible across different backends like Honeycomb, Jaeger, or Arize Phoenix. Key attributes to track include:

  • gen_ai.request.model: The specific model version (e.g., claude-3-5-sonnet).
  • gen_ai.request.prompt: The raw prompt sent to the model.
  • gen_ai.response.content: The raw completion.
  • gen_ai.usage.prompt_tokens and gen_ai.usage.completion_tokens: Exact token counts for cost analysis.
// Conceptual OTel Span for an LLM Call
span = tracer.start_span("llm_completion")
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.request.prompt", user_prompt)
try:
    response = client.generate(user_prompt)
    span.set_attribute("gen_ai.response.content", response.text)
finally:
    span.end()

Tracing Agentic Loops and Sub-Agents

Agentic workflows often involve a "planner" and several "executors." When these agents communicate, you need a trace ID that persists across the entire chain. If a planner delegates a task to a sub-agent, the sub-agent's span must be linked to the parent trace.

This is particularly critical when using the Model Context Protocol (MCP). When an agent calls an MCP tool over stdio or HTTP, the trace should extend into the tool's execution. This reveals if the bottleneck is the LLM's reasoning or the tool's data retrieval latency.

Comparing Observability Approaches

Many developers start with built-in logging in frameworks like LangChain or LlamaIndex. While useful for prototyping, these are often proprietary or tightly coupled. Tools like Cursor or GitHub Copilot provide an integrated experience, but they abstract the telemetry away from the developer. For those building sovereign agent platforms, owning the telemetry pipeline is non-negotiable.

AZMX AI takes a different approach to visibility. Instead of opaque cloud logs, it focuses on local transparency. By using a real PTY terminal and an approval-gated edit system, the user acts as the final observability layer. While AZMX does not ship as a managed OTel backend, its architecture—supporting BYOK and local MCP servers—allows developers to wrap their own MCP tools in OTel instrumentation to monitor how the agent interacts with their system.

Reducing Noise in LLM Traces

The primary challenge with OTel for LLMs is the volume of data. Tracing every single token or prompt in a high-traffic app is expensive and noisy. Use these strategies to manage the load:

  1. Head-based Sampling: Only trace a percentage of requests (e.g., 5%) during normal operation.
  2. Tail-based Sampling: Keep all traces that result in an error or exceed a specific latency threshold (e.g., > 5 seconds).
  3. Attribute Filtering: Strip large context windows from traces in production, keeping only the final prompt and the response.

Conclusion

OpenTelemetry is the only viable path toward professional LLM operations. By treating AI calls as distributed spans rather than isolated logs, you can move from guessing why an agent failed to knowing exactly which token triggered the error. For those building these systems, prioritizing a security-first architecture and vendor-neutral telemetry ensures that your agentic stack remains maintainable as models evolve.

One window. The whole loop.