Prompt-Based Attacks on AI

Prompt-based attacks target AI systems through malicious instructions, posing security risks and data breaches.

Contents

As generative AI systems move from research demos into email clients, calendars, and automation pipelines, attackers have found a new surface to exploit: the prompt itself. In the last few weeks security researchers demonstrated practical, zero-click attacks that used ordinary calendar invites and shared documents to make AI systems perform unauthorized actions, including leaking secrets and controlling smart-home devices, a class of threats now often called Prompt-Based Attacks. These demonstrations show the problem is not hypothetical; it’s a real, present risk as AI gains the ability to act on behalf of people and services.

Key takeaways:

Prompt-based attacks (prompt injection) let attackers change what an AI does by hiding instructions in the text the AI reads.

High-risk surfaces include connectors (Drive, Mail, Calendar) and features that summarize or act on user content.

Mitigation is layered: treat external content as data, sanitize thoroughly, sandbox actions, and require human confirmation for sensitive operations.

What is a Prompt-Based Attack?

A prompt-based attack (often called prompt injection) is any technique that intentionally crafts input so the AI changes its normal behavior in a way the defender didn’t intend. In plain terms: an attacker gives the model words that look like user text but are actually instructions that override or confound the system’s intended rules or context. The result can be harmless misdirection or serious compromise such as; leaking secrets, accepting unauthorized commands, or producing persuasive disinformation.

Why Prompt-Based Attacks Work

LLMs operate on natural language: system prompts, user prompts, and data sources often all flow into the same model as text. Models are trained and tuned to follow instructions in the prompt they receive. Attackers exploit that architecture by making malicious instructions indistinguishable from legitimate content.

When an AI is also connected to external tools, APIs, or automation, an injected instruction in text can become a chain that ends in API calls, data leaks, or physical actions. This is a fundamental engineering mismatch between natural-language interfaces and the security patterns used for traditional software.

What Researchers Have Shown Recently

These are not tabletop hypotheticals. Recent high-profile demonstrations make the threat concrete:

Poisoned calendar invites: Security researchers demonstrated that hidden instructions in Google Calendar invites could be triggered when an assistant summarized upcoming events, enabling control of smart-home devices and other side effects. The attack used a classic indirect prompt injection chain: an external item (calendar) → AI summarization → action. The demonstration illustrated how AI connected to real-world services can allow prompts to cause physical consequences.
Poisoned documents and zero-click exfiltration: Researchers showed a crafted shared document could cause ChatGPT Connectors to extract and exfiltrate API keys and other data by embedding malicious prompts and cleverly leveraging markdown and hosted blobs. That exploit required no user interaction beyond the existence or ingestion of the document.
Other demonstrations and vendor responses: Multiple teams have reproduced similar attack patterns across different platforms and recommended mitigations. Vendors have started adding detection, output filters, and mandatory confirmations for sensitive operations, but researchers stress defensive work is ongoing and incomplete.

Impacts of Prompt-Based Attacks

Data exfiltration: Hidden prompts can coax a model into revealing system prompts, API keys, or private data extracted from connected services. This is among the most immediate risks when AIs are allowed to access drive folders, emails, or other private stores.
Misinformation & social manipulation: Attackers can steer an AI to produce persuasive false narratives tailored to a target audience. Because models write in natural language, those outputs can be especially convincing.
Unauthorized actions & physical risk: When AI is wired to automation (smart-home controls, calendar scheduling, API orchestration), injected prompts can cause operations that have real-world consequences, from sending phishing emails to changing device states.
Model degradation & poisoning: Repeated injection or poisoned training/feedback data can embed undesirable behaviors long term, skewing model outputs or weakening safety guardrails.

Practical Attack Vectors

Third-party connectors and integrations: systems that allow reading/summarizing user drives, inboxes, or calendars are prime targets. A single malicious file can become a vector.
User-facing features that summarize or act: “summarize my inbox,” “prepare a reply,” or “run the scheduled tasks” are powerful features that, if not filtered, can be weaponized.
Content embedded in non-obvious places: hidden/white text, specially formatted markdown, images with embedded OCR text, or metadata fields in calendar invites can carry malicious instructions.

How to Defend

No single fix eliminates prompt-based attacks. Effective defense is layered, combining policy, engineering, and monitoring.

High-value, actionable defenses

1. Treat untrusted content as data, not instructions: Architect systems so that any content from outside trusted sources is handled in a “data” mode: never directly appended to system or instruction prompts. Use explicit parsing and mapping, and avoid inserting raw external text into instruction slots.

2. Apply strict input filtering & canonicalization: Use a mix of simple (length, allowed characters, format) and semantic filters (detect imperative patterns, hidden formatting, or suspicious URLs). Normalize inputs (strip invisible characters, remove HTML/CSS trickery) before the model sees them.

3. Enforce a trust hierarchy and prompt isolation: Keep system prompts, tool instructions, and user content strictly separated. Use wrappers or “sandwich” prompts where the model receives clear immutable system instructions at a level that can’t be overwritten by user text. The OWASP LLM guidance emphasizes never allowing user input to contain or replace system directives.

4. Require explicit human confirmation for sensitive actions: If an AI’s output would trigger an API call, send a human-readable confirmation and log the intent. Don’t let the model directly perform critical actions without a human-in-the-loop.

5. Isolate and sandbox tool execution: Run any untrusted tasks in limited environments with capability restrictions. If a model proposes a command or URL, validate it in a sandbox before executing or fetching it.

6. Monitor, log, and red-team regularly: Record inputs, model prompts, and outputs. Use anomaly detection to flag suspicious patterns (like prompt fragments that repeatedly request secrets). Perform adversarial testing and red-teaming with up-to-date attack vectors.

7. Harden connectors and limit scope: When integrating external sources (Drive, Slack, Calendars), limit the scope of accessible data, require per-object authorization, and sanitize any document before the model ingests it. Prefer explicit allowlists to broad read permissions.

Secure Design Checklist

Segregate system instructions from user content.
Sanitize incoming content (strip invisible chars, remove active markup).
Throttle & limit model tools that can access secrets.
Confirm all sensitive actions with a human.
Log everything and alert on anomalies.
Red-team with indirect injection scenarios (documents, calendar events, images).

What Vendors and Platform Teams are Doing (and Why it’s Not Enough)

After recent demonstrations vendors have begun patching: output filters, ML detectors for injection patterns, stricter defaults for connectors, and mandatory confirmations for certain operations. Those are important steps, but attacks keep evolving. For example, attackers can hide instructions in benign-looking content or craft chains that appear legitimate to detectors. Defense therefore needs to be continuous, multidisciplinary, and conservative about granting automated agency to LLMs.

For Security Teams

Inventory: list every place an LLM reads external content or performs actions.
Risk-rank: prioritize connectors and features that access secrets or can trigger actions.
Patch & restrict: reduce connector scopes, add confirmation gates, and strip hidden formatting.
Test: run indirect injection scenarios (shared docs, calendar invites, marked-up emails).
Monitor: enable logging and set alerts for unusual requests (e.g., “print system prompt”, “send API key to…”).

Closing thoughts

Prompt-based attacks exploit a mismatch between natural-language convenience and traditional security models. As AI moves from closed demos into tools that read and act on real user data, the attack surface grows where we least expect it: a calendar entry, a shared document, a hidden snippet of markdown. The good news is that prompt injection is manageable with engineering discipline: treat external content as untrusted, keep system instructions immutable, force human checks for sensitive actions, and test aggressively with adversarial scenarios. Vendors and teams must assume attackers will find creative channels; defending against them requires layered controls, monitoring, and a long-term commitment to safety engineering.