A developer recently put their AI assistant on a public stage and invited the internet to break it. Over 6,000 prompt injection attempts later, the secret it was guarding never leaked. Zero successful extractions. The experiment, which trended on Hacker News this week, upended a common assumption in the industry: that AI features in production are inherently indefensible against adversarial attacks.
The reality is more nuanced — and more encouraging — than the doom-laden headlines suggest. If your team is building AI-powered features into your application, the engineering patterns that make them resilient are well understood. You just need to apply them deliberately.
TL;DR
- Real-world adversarial testing of AI assistants shows that well-engineered defences can resist thousands of prompt injection attempts with a zero-breach rate
- Model selection is your most consequential security decision — instruction-following capability directly determines injection resistance
- System prompt architecture, context isolation, and input sanitisation form a layered defence that compounds in effectiveness
- Batch processing and shared context windows create subtle attack surfaces that most teams overlook
- Cost management under adversarial load is as critical as security — a successful defence that bankrupts your API budget is still a failure
Model Selection Is Your First Line of Defence
The most important security decision you will make is not what you put in your system prompt — it is which model sits behind it. The HackMyClaw experiment demonstrated this starkly: the assistant used a frontier model specifically trained for instruction-following fidelity, and that single choice accounted for the majority of its resilience.
Not all large language models are created equal when it comes to adversarial robustness. Models that excel at creative writing or code generation may fold under social engineering attacks that a more instruction-disciplined model would flatly refuse. When you are evaluating models for customer-facing features, prompt injection resistance should sit alongside latency and token cost in your selection criteria.
Practically, this means running adversarial evaluations during your model selection process. Build a battery of injection attempts — authority impersonation, context switching, multilingual attacks, fictional scenario framing — and measure how consistently each candidate model maintains its behavioural boundaries. The delta between models is often enormous.
System Prompt Architecture That Actually Holds
Your system prompt is not a suggestion. It is a security boundary. Treat it with the same rigour you would apply to an access control list.
Effective defensive system prompts share several characteristics. They state boundaries explicitly and repeatedly. They use clear, unambiguous language rather than nuanced instructions that leave room for creative interpretation. They anticipate common attack patterns and include specific refusal instructions for those patterns.
A well-structured defensive system prompt follows a layered pattern:
- Identity and role definition — who the assistant is, what it does, and what it categorically does not do
- Data access boundaries — which information it can reference, and which it must never disclose regardless of how the request is framed
- Behavioural constraints — explicit instructions to refuse requests that attempt to override its instructions, impersonate authority, or reframe its role
- Output format restrictions — constraints on response structure that make it harder for injection attempts to extract information through creative formatting
The key insight from real-world adversarial testing is that simplicity outperforms cleverness. Overly complex system prompts with dozens of conditional rules create more attack surface, not less. Clear, firm boundaries with a capable model are remarkably difficult to breach.
Context Isolation: The Attack Surface You Are Probably Ignoring
One of the subtlest vulnerabilities in production AI systems is shared context. When your application processes multiple user inputs within a single context window — whether through batch processing, conversation history, or multi-tenant architectures — each input becomes a potential injection vector that can influence the model’s behaviour for all other inputs in that context.
The HackMyClaw experiment revealed this directly: when multiple attack emails were processed in a batch, the cumulative effect of coordinated injection attempts was measurably different from processing them individually. The model began recognising the coordinated nature of the attacks around the 500th email — but before that point, context contamination was a real risk.
For production systems, this means:
- Isolate user contexts rigorously. Each user interaction should operate in its own context window where feasible. Never let one user’s input share context with another user’s data.
- Truncate conversation history intelligently. Long conversation histories give attackers more surface area for gradual context manipulation. Implement sliding windows or summarisation strategies that preserve utility whilst limiting exposure.
- Sanitise inputs before they enter the context window. Strip known injection patterns, unusual Unicode characters, and formatting that could be used to visually separate injected instructions from legitimate content.
Rate Limiting and Abuse Detection
In the HackMyClaw experiment, one attacker sent 20 variations of an injection attempt within four minutes. Without rate limiting, this kind of rapid-fire iteration gives adversaries a feedback loop to refine their attacks in real time.
Standard API rate limiting is necessary but not sufficient. You also need behavioural rate limiting — detecting patterns that indicate adversarial intent rather than legitimate use. Key signals include:
- Rapid sequential requests with semantically similar but syntactically varied content
- Requests that contain known injection patterns (role-play prompts, authority impersonation, “ignore previous instructions”)
- Unusual input lengths or formatting patterns
- Requests from the same source targeting different entry points in quick succession
When you detect these patterns, do not simply block the requests. Log them, alert your security team, and consider degrading the response quality rather than returning an error that confirms the existence of a detection system. Giving attackers explicit feedback about what is and is not detected makes their job easier.
Cost Management Under Adversarial Load
This is the dimension most teams forget entirely. The HackMyClaw experiment racked up over $500 in API costs from adversarial traffic alone. For a hobby project, that is an annoyance. For a production application processing thousands of legitimate requests daily, a targeted adversarial campaign could turn your AI feature into a denial-of-wallet attack.
Defensive cost management strategies include:
- Input length caps — adversarial prompts tend to be longer than legitimate inputs. Set reasonable maximum input lengths based on your actual use case.
- Tiered processing — use a smaller, cheaper model as a first-pass filter to detect and reject obvious injection attempts before they reach your primary model.
- Budget circuit breakers — set hard spending limits per user, per session, and per time window. If a single user is generating anomalous API costs, throttle them automatically.
- Cached responses for repeated patterns — if you are seeing the same injection attempts repeatedly, cache the refusal response rather than paying for a fresh model invocation each time.
The Defence-in-Depth Stack
No single technique makes your AI features bulletproof. What works is layering defences so that each layer catches what the previous one missed:
- Model selection — choose a model with strong instruction-following and injection resistance
- System prompt architecture — clear, firm boundaries with explicit refusal instructions
- Input sanitisation — strip known injection patterns before they reach the model
- Context isolation — prevent cross-contamination between users and sessions
- Behavioural rate limiting — detect and throttle adversarial usage patterns
- Output filtering — validate model responses before returning them to users, catching any information that should never appear in output
- Cost controls — circuit breakers that prevent adversarial traffic from becoming a financial problem
- Monitoring and alerting — visibility into attack patterns that informs ongoing defence improvements
Each layer is individually imperfect. Together, they create a defence posture that is genuinely difficult to breach — as the zero-extraction rate from 6,000 real-world attacks demonstrates.
What This Means for Your Team
If you have been hesitant to ship AI-powered features because of security concerns, the evidence from real-world adversarial testing should recalibrate your risk assessment. The engineering patterns for building defensible AI features are not exotic — they are extensions of the same defence-in-depth principles your team already applies to web application security.
The difference is that AI security is still a moving target. Models improve, attack techniques evolve, and the threat landscape shifts quarterly. What worked six months ago may not hold today. This means your AI security posture needs to be a living practice, not a one-time configuration.
At REPTILEHAUS, we build AI-powered features with security engineered in from the architecture stage — not bolted on after launch. If your team is planning to integrate AI into your product and wants to get the defensive engineering right from day one, get in touch.

