For the last twenty years, web architecture has followed a simple, elegant contract: keep your servers stateless, put your state in the database, and let a load balancer distribute traffic evenly. This model powered everything from early Rails apps to global-scale SaaS platforms. It worked because HTTP requests were fast, predictable, and cheap.
Then large language models arrived — and quietly broke all three assumptions.
TL;DR
- Traditional stateless web architecture assumes fast, cheap, predictable requests — LLM workloads violate all three
- Long-running agent tasks, stateful conversation context, and bidirectional streaming expose a missing routing primitive in most backend stacks
- Using your database as a message bus (polling) is the default workaround, and it scales terribly
- Durable execution frameworks, pub/sub routing, and connection-resilient streaming patterns are the emerging architectural answers
- Teams building AI features on traditional backends will hit these walls — better to design for them now
The Three Assumptions That No Longer Hold
The classic web backend was built around three core assumptions that held true for two decades. LLM workloads break every one of them.
1. Requests Are Fast
A typical API call returns in 50–200 milliseconds. Even complex database queries rarely exceed a second. Load balancers, connection pools, and timeout settings are all calibrated around this expectation.
An LLM inference call? That can take 5–30 seconds for a single completion. An agentic workflow — where the model reasons, calls tools, evaluates results, and iterates — can run for minutes. Some production agent tasks take ten minutes or more. Your Nginx timeout defaults were not designed for this.
This is not just a performance problem. It is a fundamental mismatch between your infrastructure’s assumptions and your workload’s reality. Connection pools exhaust. Worker threads block. Autoscalers misread load signals because a server handling three LLM requests looks idle by CPU metrics but is completely saturated.
2. Compute Is Stateless
The stateless server model works brilliantly when every request is independent. Any server can handle any request because the database holds all the state.
But an AI agent maintaining a multi-turn conversation carries context. It remembers what the user said three turns ago, what tools it called, what results came back. This context is not a database row — it is an in-flight computational state that lives in memory on a specific server process. If that process dies or the connection drops, you cannot simply replay the request to another server. The context, the partial results, and the reasoning chain are lost.
Some teams serialise conversation state to a database between turns. This works for simple chatbots but falls apart for agentic workflows where the model is mid-execution — halfway through a tool call chain, holding intermediate results, waiting on an external API. Serialising that state is like trying to checkpoint a running programme. It is technically possible but architecturally painful.
3. Communication Is Request/Response
HTTP’s request/response model assumes the client asks and the server answers. One direction, one exchange, done.
LLM applications need bidirectional, streaming communication. The user sends a prompt. The model streams tokens back over seconds. The user sees the response forming in real time. Midway through, the user might want to cancel, redirect, or provide additional input. The model might need to ask a clarifying question or request approval before executing a tool.
This is not request/response. It is a conversation — a sustained, bidirectional channel between a specific client and a specific server-side process. And your load balancer has no concept of routing a follow-up message to “whichever server is currently running workflow X for user Y.”
The Database-as-Message-Bus Anti-Pattern
When teams first hit these walls, the instinct is to reach for what they know. The result is almost always the same: polling the database.
The client submits a request. The server kicks off an LLM task and writes status updates to a database table. The client polls an endpoint every few seconds, reading from that same table. It works. Barely.
The problems compound quickly. Polling latency means the user experience feels sluggish — they are always seconds behind the actual state. Database load scales linearly with connected clients, not with actual work being done. You are burning reads on “has anything changed?” queries that almost always return “no.” And the pattern completely breaks down when you need real-time streaming of partial results, because you cannot poll fast enough without destroying your database.
We have seen this pattern repeatedly in client projects: a team builds a promising AI feature, launches it behind a prototype, and then discovers that their entire backend architecture fights them when they try to scale it.
What the New Architecture Looks Like
The good news is that the industry is converging on solutions. The bad news is that they require rethinking some deeply embedded assumptions.
Durable Execution for Long-Running Tasks
Frameworks like Temporal, Restate, and Inngest solve the stateful, long-running process problem by treating workflows as first-class, durable entities. If a server crashes mid-execution, the workflow resumes from its last checkpoint on another server. This is not new technology — it borrows from decades of workflow orchestration — but it is newly critical for AI workloads.
The key insight: your agent task is not an HTTP request. It is a workflow. Model it as one.
Addressable Channels for Routing
The missing primitive in most architectures is a durable, named channel — a pub/sub topic or WebSocket room that both the client and the server-side process can connect to by name. When the client reconnects after a network blip, it does not need to know which server is running its workflow. It subscribes to the channel by workflow ID and picks up where it left off.
Technologies like Redis Streams, NATS JetStream, and cloud-native pub/sub services provide this. Server-Sent Events (SSE) over HTTP/3 offer a lighter-weight option for streaming results without the overhead of WebSockets. The pattern works because it decouples the routing question (“where is my workflow?”) from the infrastructure question (“which server is handling it?”).
Connection-Resilient Streaming
Mobile networks drop. Browser tabs suspend. WiFi switches. Your streaming architecture needs to handle all of this gracefully. The answer is resumable streams — where the client can reconnect and say “I last received token 847, send me everything after that.”
This requires your streaming layer to buffer recent output and support cursor-based resumption. It is more complex than a naive WebSocket pipe, but it is the difference between a production-grade AI application and a demo that breaks on every underground train ride.
GPU-Aware Load Balancing
Traditional load balancers distribute requests round-robin or by least connections. Neither metric makes sense for LLM inference, where a single request can monopolise a GPU for seconds. You need load balancers that understand inference queue depth, available VRAM, and estimated completion time — metrics that most off-the-shelf load balancers do not expose.
Teams self-hosting models are building custom routing layers. Teams using inference APIs (OpenAI, Anthropic, etc.) are building LLM routers that distribute across multiple providers based on cost, latency, and availability. Either way, the load balancing layer needs to get smarter.
Practical Steps for Your Team
If you are adding AI features to an existing application — and most teams are — here is where to start:
- Separate your AI workload from your web workload. Do not run LLM tasks on the same worker pool that serves your API. Different timeout profiles, different scaling characteristics, different failure modes.
- Adopt a task queue or workflow engine for anything that takes more than a few seconds. Temporal and Inngest are mature choices. Even a well-configured BullMQ or Celery setup is better than blocking HTTP workers.
- Implement SSE or WebSocket channels with resumption support for streaming results. Your users expect to see tokens appear in real time — polling will not cut it.
- Design for disconnection. Assume the client will lose connection mid-stream. Buffer output. Support cursor-based replay. Make reconnection seamless.
- Monitor differently. Request duration percentiles, GPU utilisation, inference queue depth, and token throughput matter more than traditional p99 latency for your AI endpoints.
This Is an Architecture Problem, Not a Model Problem
The industry conversation around AI in production tends to focus on model selection, prompt engineering, and RAG pipelines. These matter. But the teams we work with at REPTILEHAUS are increasingly hitting a different wall entirely: their backend architecture was not designed for the workloads they are now asking it to handle.
The twenty-year-old contract — stateless servers, database state, request/response — served us extraordinarily well. But LLM workloads demand long-running processes, stateful computation, and bidirectional streaming. Pretending otherwise leads to fragile, expensive workarounds that collapse under real traffic.
The architectural patterns to solve this exist today. The question is whether your team recognises the problem before your users do.
Building AI features and hitting infrastructure walls? Get in touch — our team specialises in designing backend architectures that handle AI workloads at scale.
📷 Photo by Taylor Vick on Unsplash



