Skip to main content

Your Datadog dashboard is green. Your API latency sits comfortably below 200ms. Your error rates are negligible. And yet your AI-powered feature is producing wildly inconsistent results, haemorrhaging tokens on redundant calls, and occasionally hallucinating data that sends customers down entirely wrong paths.

Welcome to the observability gap — the growing chasm between what traditional Application Performance Monitoring (APM) tells you and what you actually need to know about your AI workloads in production.

TL;DR

  • Traditional APM tools (Datadog, New Relic, Grafana) track infrastructure metrics but miss AI-specific behaviour like hallucination rates, prompt drift, and semantic quality degradation
  • The LLM observability market hit $2.69 billion in 2026, with Gartner predicting 50% of GenAI deployments will require dedicated observability by 2028
  • AI-native tracing tools (Langfuse, LangSmith, Arize) capture prompt-response pairs, token economics, and quality evaluations that infrastructure monitors cannot
  • Production AI systems need three observability layers: infrastructure metrics, AI-native tracing, and continuous quality evaluation
  • Teams that bolt LLM monitoring onto existing APM without adding quality evaluation are monitoring their infrastructure, not their AI

The Problem with Treating LLMs Like Any Other API

When your application calls a REST API, the contract is clear: you send a request, you get a structured response, and success or failure is binary. A 200 means it worked. A 500 means it didn’t. Your monitoring tools were built for this world.

LLM calls break every one of these assumptions. A successful HTTP response from your model provider tells you almost nothing about whether the output was actually good. The model can return a perfectly formatted JSON response that contains fabricated data. It can produce a grammatically flawless summary that misses the core point entirely. It can generate code that compiles but introduces a subtle security vulnerability.

Traditional APM tools will tell you the call took 2.3 seconds and consumed 4,200 tokens. They will not tell you the response quality dropped by 15% after your last prompt template change, or that your retrieval-augmented generation (RAG) pipeline is pulling irrelevant context chunks 30% of the time.

Three Layers of AI Observability

Production AI systems that actually work require observability at three distinct layers. Most teams only have the first.

Layer 1: Infrastructure Metrics (What You Already Have)

This is your existing APM stack — latency, error rates, throughput, and uptime. For LLM workloads, it extends to token consumption, model endpoint availability, and API rate limits. If your AI feature goes down because your provider is having an outage, your existing Datadog or Grafana setup will catch it. This layer is necessary but nowhere near sufficient.

Layer 2: AI-Native Tracing (What Most Teams Are Missing)

This is where purpose-built LLM observability tools earn their keep. AI-native tracing captures the full prompt-response lifecycle: the system prompt, user input, any RAG context injected, the model’s reasoning chain, tool calls made, and the final output. Crucially, it preserves the relationship between these elements across multi-step agent workflows.

Tools like Langfuse, LangSmith, and Arize AI provide this layer. They let you inspect individual traces, spot prompt template regressions, track token economics per feature, and understand why a particular interaction produced a poor result — not just that it did.

The difference matters enormously for debugging. When a customer reports that your AI assistant gave wrong advice, infrastructure metrics will show you a successful API call with normal latency. AI-native tracing will show you the exact prompt that was constructed, the context documents that were retrieved, and the model’s output — letting you pinpoint whether the issue was bad retrieval, a prompt template bug, or a model limitation.

Layer 3: Continuous Quality Evaluation (What Separates Good from Great)

This is the layer that most organisations haven’t even considered, and it’s arguably the most important. Continuous evaluation means running automated quality checks against your AI outputs in production — not just in your test suite, but on every interaction or a statistically significant sample.

This includes metrics like:

  • Faithfulness: Does the output actually reflect the source documents, or is the model hallucinating?
  • Relevance: Does the response address what the user actually asked?
  • Semantic drift: Has the quality of outputs changed over time, perhaps due to model updates or shifting input patterns?
  • Toxicity and safety: Are outputs staying within your defined guardrails?
  • Cost efficiency: Are you spending tokens on needlessly verbose responses or redundant tool calls?

Without this layer, you are flying blind. Your AI feature could be gradually degrading in quality for weeks before a customer complaint triggers an investigation.

The Hidden Cost Problem

One of the most compelling arguments for dedicated LLM observability is cost governance. Unlike traditional compute resources, where costs scale relatively predictably with traffic, LLM costs can spike dramatically based on prompt length, context window usage, and retry patterns.

We have seen production systems where a single poorly constructed prompt template was responsible for 40% of the total AI spend — padding every request with unnecessary context that inflated token counts without improving output quality. Traditional monitoring showed healthy throughput numbers. AI-native observability revealed the waste immediately.

The best LLM observability platforms now include token-level cost attribution, letting you track spend per feature, per user segment, and per prompt template version. When your AI budget is measured in thousands per month, this granularity is the difference between sustainable scaling and runaway costs.

What a Practical LLM Observability Stack Looks Like

For teams building AI-powered features into existing products — which describes most of our clients at REPTILEHAUS — here is what we recommend:

  1. Keep your existing APM for infrastructure metrics. Datadog, Grafana, New Relic — whatever you use, keep using it. It handles the plumbing.
  2. Add an AI-native tracing layer. Langfuse (open-source, self-hostable) is our preferred choice for teams that want control over their data. LangSmith works well if you are already in the LangChain ecosystem. Arize is strong for teams needing production ML monitoring alongside LLM tracing.
  3. Implement continuous evaluation. Start simple — even basic checks like output length bounds, JSON schema validation, and keyword presence can catch regressions. Graduate to LLM-as-judge evaluations for semantic quality once your volume justifies it.
  4. Set up cost alerting. Token consumption should have budget alerts just like your AWS spend. Per-feature attribution is ideal; per-model-provider is the minimum.
  5. Instrument your RAG pipeline separately. If you are doing retrieval-augmented generation, track retrieval quality independently from generation quality. A bad retrieval step poisons everything downstream.

The Gateway Approach

An increasingly popular pattern is the AI gateway — a lightweight proxy that sits between your application and your model providers. Tools like Helicone, Portkey, and LiteLLM act as this layer, adding observability, caching, fallback routing, and cost tracking with minimal code changes.

The gateway approach is particularly attractive for teams that use multiple model providers or want to implement model routing strategies (sending simple queries to cheaper models and complex ones to more capable models). The observability comes almost for free as a side effect of the routing layer.

That said, gateways add a network hop and a dependency. For latency-sensitive applications, evaluate whether the observability benefits justify the added milliseconds.

Mistakes We See Teams Making

Logging everything, evaluating nothing. Capturing every prompt and response is easy. Doing something meaningful with that data requires deliberate evaluation pipelines. Terabytes of stored traces are worthless without quality metrics.

Treating LLM observability as a DevOps problem. Your infrastructure team can manage the tooling, but defining what “good output” looks like requires product and domain expertise. Quality evaluation criteria must come from the people who understand the use case.

Ignoring prompt versioning. When your AI feature’s quality drops, the first question should be: “What changed?” Without versioned prompts linked to your observability traces, answering that question means digging through git history and correlating timestamps manually.

Waiting for scale. The best time to instrument your AI stack is before you have problems, not after a production incident. The patterns you establish at low volume become the guardrails that protect you at scale.

Where This Is Heading

Gartner’s prediction that 50% of GenAI deployments will require dedicated observability by 2028 — up from roughly 15% today — suggests the market has barely begun to mature. We are seeing rapid consolidation, with traditional APM vendors acquiring or building AI-specific capabilities, while AI-native tools expand into broader observability.

The teams that invest in proper LLM observability now will have a significant advantage: they will understand their AI systems deeply enough to optimise cost, improve quality, and debug issues quickly. The teams that treat their LLM calls like any other API endpoint will spend their time firefighting.

Need Help Building Your AI Observability Stack?

At REPTILEHAUS, we help development teams integrate AI capabilities into production applications — including the observability infrastructure that keeps them reliable and cost-effective. If you are building AI-powered features and want to ensure they are properly instrumented from day one, get in touch.

📷 Photo by Luke Chesser on Unsplash