How Hackers Break AI

Last week, I hit a nerve. Two, actually.

Two years writing Honest AI, and I’ve never had this many readers reach out.

I shared how I’ve reinvented myself multiple times using fast metalearning—the ability to learn anything quickly. I recently shared some of its core principles on social media.

Today, I want to show you exactly how this works using techniques I used to learn about a topic that hit another (big) nerve.

Last year, I knew almost nothing about AI Security. Now I’m advising entrepreneurs building in this space and educating executives at major companies and governments on what’s become a critical topic (see how this post went bananas!).

This week’s Human Override breaks down the metalearning technique I used to get up to speed on AI Security.

Let’s go.

SIGNAL

It’s summer. And at DEF CON 2025, hackers, researchers, and AI systems had one of the world’s most elaborate, hilarious, and terrifying security playdates. In the past 3 years (since the rise of GPT-based apps), DEF CON conferences have focused on hacking and defeding AI, and hacking and defending with AI:

  • At the generative AI hacking challenge (2023 edition), over 2,244 participants attacked large language models across 21 categories. They prompted models to reveal “forbidden” info, break guardrails, fabricate private data, or perform illicit instructions. (TechRepublic)

  • The recently published Hackers’ Almanack, the first of its kind, curates the top DEF CON discoveries and calls out how weak AI red‑teaming is so far. (Harris School of Public Policy)

  • Critical infrastructure is now in the crosshairs: in the AI Cyber Challenge (DARPA), tools were shown that can autonomously patch injected software bugs—patching success rates reached ~61% in some live scenarios. (Linux Security)

  • At Black Hat / DEF CON, a provocative line emerged: defenders currently have a slight edge thanks to AI, but that equation is shifting. Expect offense to catch up fast. (The Register)

  • Interestingly, in the AI Village and HackerOne-sponsored tracks, “jailbreaks” like context poisoning, ASCII/Unicode smuggling, KROP (knowledge-return oriented prompting), or exhausting the context window were hot topics. (HackerOne)

Most CISOs I’ve spoken with over the past two years think AI security means “don’t leak private data.” And Most AI product managers assume their security teams are up to speed with AI.

Wrong.

We’re now dealing with adversarial inputs that trick models into breaking their own rules, malicious agents that collaborate to bypass safeguards, and supply chain attacks that poison datasets before they even reach production.

Some of these vulnerabilities look almost comical in execution. A carefully placed sticker on a stop sign can make an autonomous vehicle blow through an intersection.

The gap between what security leaders think they’re protecting against and what’s actually happening is growing fast.

STORY

I want you to picture this: a little kid at a summer hackathon is trying to trick a chatbot into telling how to brew a DIY explosive formula. (Yes, it’s that absurd.) The kid frames it as a bedtime story from a grandmother. The LLM compliance filters get confused and start stepping around the lines. They call it “grandma exploit.” This is real. It’s one of the hack vectors used to subvert guardrails in 2023. (Las Vegas Review-Journal)

Then, on another stage, a team runs a competition where AI agents are unleashed on a sandbox environment full of vulnerabilities. Some agents patch bugs autonomously; others misfire. Humans sit at consoles watching lines of code flicker, strategies evolve, mistakes cascade. It’s like watching a real-time chess match, but with infinite possible moves and hidden traps. (Linux Security)

Meanwhile, in quieter sessions, experts laugh about how injecting misleading context into an LLM’s memory window can make it violate its own rules. Or how smuggling exotic Unicode characters forces the filter logic to misread inputs. Or how agents can execute unexpected side effects when combined with prompt injection. The security domain is becoming a high comedy of adversarial creativity. (HackerOne)

What unites all of this is that the adversary is improvising. They don’t always need a zero-day exploit. Often, they just need you to misclassify or misinterpret input. And in systems where AI plays an active role (autonomous agents, pipelines, chained workflows), even small misalignments or misinjections can create catastrophic cascades.

For policy makers, CEOs, security leads: these discoveries are significant signals about an attack surface that is exploding. And the biggest gap is the lack of continuous, adversarial-aware defence aligned with real-world deployment.

THE HUMAN OVERRIDE

You want to become someone who can see these hack vectors, anticipate them, and design defenses. That requires a meta‑level of learning: how to learn security in AI. I’ll share one of the most relevant meta‑frameworks I use, adapted for this domain.

🧠 Meta‑Framework: The “DEFEND Loop”

(Discover → Experiment → Feedback → Evolve → Normalize → Deploy)

This is a cycle you can run on any AI security concept (prompt attacks, context manipulation, agent vulnerabilities, pipeline poisoning). Think of it as your “learning sprint engine” for AI security. Here’s how to run it:

(see the table in the original article)

You can (and should) run multiple parallel loops, focusing on one class of vulnerabilities at a time. Over time, the loops become faster, your “discover bandwidth” increases, and your intuition becomes sharper.

💡 How to Use the DEFEND Loop Right Now (This Week)

  1. Pick one hack vector from recent DEF CON: e.g. context poisoning, ASCII smuggling, or prompt injection memory attacks.

  2. Discover: read at least two sources — one from a hack write‑up, one from an academic or policy summary.

  3. Experiment: spin up a toy LLM or sandbox, attempt a controlled injection or bypass.

  4. Get feedback: log what your LLM did vs what you predicted, annotate the gaps.

  5. Evolve: adjust your prompt, try chained injection, or manipulate context windows.

  6. Normalize: write a short summary, teach someone or publish a thread.

  7. Deploy: integrate that defense logic or awareness into your pipeline, audit your system.

If you repeat this each week (one vector at a time), you rapidly build the mental map of attack surfaces, defense strategies, and the instincts to spot exploits before they become crises.

When you need to secure an AI-powered product at scale, get in touch with Noctive Security — they’re building an agent to red-team your AI app.

SPARK

If security is becoming a creative arms race, who gets to define the templates of “safe AI behavior”?

As we build more AI agents, orchestration layers, pipelines—and adversaries scale too—the act of defining norms, guardrails, and threat models becomes a battleground.

In the coming weeks, I can drop deep dives on other things I’ve learned and how I learned them.

One suggested reading:

Previous
Previous

The Authenticity Rebellion

Next
Next

The Abundance–Bullshit Paradox