Skip to main content

There is a quiet crisis unfolding in engineering budgets across Dublin and beyond. Your monitoring bill — the cost of simply knowing what your application is doing — is growing faster than your infrastructure, faster than your revenue, and sometimes faster than your headcount.

For many growing teams, observability has become the second or third largest line item after compute and salaries. And the worst part? Most of that spend is waste.

TL;DR

  • Observability costs are growing 30–50% year on year for most teams, with mid-sized companies spending €50,000–€150,000 annually on monitoring alone
  • Log ingestion is the single biggest cost driver — a single noisy microservice can add thousands to your monthly bill overnight
  • OpenTelemetry has matured into a production-ready vendor abstraction layer, giving teams the freedom to move data between backends without re-instrumenting
  • A tiered observability strategy — hot, warm, and cold storage — can cut costs by 40–90% without sacrificing the data you actually need during incidents
  • The split-vendor model (keep your primary tool for what it does best, migrate everything else) is now the pragmatic default for cost-conscious teams

How We Got Here

The observability market exploded alongside microservices. When your application was a single Rails or Django monolith, monitoring meant Nagios checks and a log file. When you decomposed that monolith into dozens of services across Kubernetes, you suddenly needed distributed tracing, structured logging, metrics aggregation, and error tracking — all correlated, all searchable, all retained.

Vendors like Datadog, New Relic, and Splunk were more than happy to oblige. They built genuinely excellent platforms. They also built pricing models that scale super-linearly with your infrastructure.

Here is the uncomfortable reality in 2026: Datadog’s bills are growing 30–50% year on year for most teams, and mid-sized companies routinely spend €50,000–€150,000 annually on full-stack observability. Enterprise deployments easily exceed €1 million. And that is before you factor in the AI workloads that are now generating orders of magnitude more telemetry than traditional web applications.

The Three Cost Traps

1. The Log Ingestion Trap

Logs are the worst offender. A single noisy service — perhaps a chatty third-party integration, a misconfigured debug level left on after a deployment, or an AI agent generating verbose reasoning traces — can dump terabytes in a single day. Most observability platforms charge on both ingestion and indexing, and most teams do not set up exclusion filters until after the first surprise invoice.

We have seen clients receive invoices three to four times their expected amount because a Friday deployment flipped a logging level from warn to debug and nobody noticed until Monday.

2. The High-Watermark Billing Trap

Datadog uses a high-watermark billing model that measures the 99th percentile of monthly usage. A short-term infrastructure spike — an autoscaling event during a traffic burst, a load test, or a CI/CD pipeline spinning up temporary containers — inflates the bill even when average load is dramatically lower. You pay for the peak, not the norm.

3. The Feature Creep Trap

Modern observability platforms are extraordinary Swiss Army knives. APM, logs, metrics, traces, profiling, real user monitoring, synthetic monitoring, security monitoring, CI visibility — each module has its own pricing dimension. Teams enable features during a trial, forget to disable them, and the costs compound quietly until someone finally audits the invoice.

The OpenTelemetry Inflection Point

The single most important development in the observability space over the past two years is not a new tool — it is a standard. OpenTelemetry (OTel) has matured from a promising CNCF project into a production-ready instrumentation framework that every major observability vendor now supports.

Why does this matter for your bill? Because OpenTelemetry decouples instrumentation from backend. Once your application emits telemetry in OTel format, you can route that data to any compatible backend — Datadog, Grafana Cloud, SigNoz, Uptrace, or your own self-hosted stack — without touching a single line of application code.

This is the vendor abstraction layer that the industry has needed for a decade. It means you are no longer locked into a pricing model just because you instrumented your code with a proprietary SDK three years ago.

A Practical Cost-Reduction Playbook

Step 1: Audit What You Actually Use

Before changing anything, understand where your money goes. Most observability platforms provide usage dashboards, but few teams actually look at them. Start with three questions:

  • Which services generate the most log volume? Typically, 10% of services generate 80% of log data.
  • Which dashboards does anyone actually open? If nobody has viewed a dashboard in 90 days, the data behind it is waste.
  • Which alerts fire, and which get actioned? Alert fatigue is both an operational and a financial problem.

Step 2: Implement Tiered Storage

Not all telemetry data is equally valuable. A tiered approach dramatically reduces costs:

  • Hot tier (0–7 days): Full-resolution data in your primary observability tool. This is what you need during active incidents.
  • Warm tier (7–30 days): Sampled data or aggregated metrics. Good enough for trend analysis and post-incident reviews.
  • Cold tier (30–365 days): Compressed logs in object storage (S3, GCS). Searchable when needed, but not indexed in real time.

Teams that implement tiered storage consistently report 40–60% cost reductions on log-related spend alone.

Step 3: Adopt the Split-Vendor Model

The pragmatic approach in 2026 is not to abandon your primary observability tool entirely — it is to use it for what it does best and move everything else to cheaper alternatives.

A common pattern we see working well:

  • Keep Datadog for APM and traces — this is where its correlation engine genuinely excels
  • Move logs to Grafana Loki or SigNoz — logs are the biggest line item and the easiest to redirect via OpenTelemetry
  • Move metrics to Prometheus + Grafana — the industry standard for Kubernetes-native monitoring, battle-tested and cost-effective
  • Move error tracking to Sentry — purpose-built, with predictable per-event pricing

With OpenTelemetry as your instrumentation layer, this split is an infrastructure routing decision, not a re-instrumentation project.

Step 4: Set Up Cost Guardrails

Prevention beats remediation. Implement these guardrails before the next surprise invoice:

  • Log level enforcement in CI/CD: Lint your configurations to ensure debug logging never reaches production unless explicitly enabled with a time-boxed expiry.
  • Ingestion budgets: Set per-service daily ingestion limits. Most platforms support this — few teams configure it.
  • Sampling strategies: Not every request needs a full distributed trace. Head-based sampling at 10–20% for healthy services, with tail-based sampling that captures 100% of errors and slow requests, gives you the signal without the noise.
  • Monthly cost alerts: Set alerts at 80% and 100% of your expected monthly spend. Treat cost overruns with the same urgency as a production incident.

The AI Observability Multiplier

If your team is running AI workloads — LLM-powered features, coding agents, RAG pipelines — your observability challenge just got harder. AI workloads generate telemetry at a fundamentally different scale: longer request durations, larger payloads, verbose reasoning traces, and unpredictable token consumption.

The temptation is to log everything for debugging. The reality is that LLM prompt-response pairs can be kilobytes each, and at scale, that adds up to terabytes of log data per month. Apply the same tiered approach: sample aggressively in production, log fully in development and staging, and use dedicated LLM observability tools (Langfuse, LangSmith) for AI-specific tracing rather than routing everything through your general-purpose platform.

When Self-Hosting Makes Sense

For teams processing more than 1 TB of logs per day, self-hosting part of your observability stack can deliver dramatic savings. The open-source ecosystem has matured significantly:

  • SigNoz: A full-stack open-source alternative built on ClickHouse, with APM, logs, and metrics in a single platform
  • Grafana Stack: Loki (logs) + Mimir (metrics) + Tempo (traces) — each purpose-built and horizontally scalable
  • VictoriaMetrics: Drop-in Prometheus replacement with significantly better compression and query performance

The trade-off is operational overhead. You are exchanging a vendor bill for engineering time. For teams under 1 TB per day, a managed split-vendor approach is usually more cost-effective. Above that threshold, the maths starts to favour self-hosting — particularly for logs.

The Bottom Line

Observability is not optional. Flying blind in production is how you turn a minor bug into a major outage. But paying more to monitor your application than to run it is not a sign of engineering maturity — it is a sign that your observability strategy has not kept pace with the tooling available in 2026.

The path forward is straightforward: instrument with OpenTelemetry, audit ruthlessly, tier your storage, split your vendors where the economics demand it, and treat your monitoring budget with the same discipline you apply to your infrastructure budget.

At REPTILEHAUS, we help teams across Dublin and beyond design observability architectures that deliver genuine insight without the runaway costs. Whether you are migrating from a monolithic monitoring platform, implementing OpenTelemetry for the first time, or simply trying to make sense of a Datadog invoice that keeps climbing — get in touch. We have been through this migration with clients at every scale, and we know where the savings are hiding.

📷 Photo by Luke Chesser on Unsplash