Securing AI Systems: AI Threat Knowledge Bases

Artificial Intelligence (AI) systems have recently gained significantly greater visibility and use by the general public. In the next five years, the AI market is expected to grow by 368%, reaching more than $400 billion by 2027. Numerous industries are exploring new ways to monetize AI in the form of AI-as-a-Service (AIaaS), providing off-the-shelf solutions that make AI more accessible with minimal upfront resources.

There is an eagerness to create and distribute AI systems to a broad customer base in an effort to increase engagement, provide recommendations, deliver personalized support, gain deeper consumer insights, and create more effective workflows. But the drive to democratize these systems comes with the need to secure them against attacks. Given paradigms on how AI models are exposed, shared, trained, and deployed, it is difficult to know what the specific threats are, how attackers will perform these attacks, what vulnerabilities could be exploited, and how to mitigate attacks.

Fortunately, efforts are underway to catalog adversarial tactics, techniques, and mitigation strategies. Building off of their work with the ATT&CK framework, MITRE has published the ATLAS matrix to document the threat landscape, organizing case studies and mitigations by adversarial tactics and techniques.

In a similar effort, Open Worldwide Application Security Project (OWASP) has documented their top 10 list of potential vulnerabilities to Large Language Models (LLMs). While a few of the listed vulnerabilities are specific to LLMs, the remainder can be extrapolated to AI models at large.

In this post, the third in a series about AI security, we will look at:

  1. The goals attackers attempt to accomplish when attacking AI
  2. The tactics used to achieve their goals
  3. Specific techniques used to get the job done

We will also discuss knowledge bases being developed to catalog attack tactics, techniques, and countermeasures that security professionals can utilize to better secure their AI systems.

Goals of AI Attacks

In order to guard against attacks to AI systems, it is fundamental to understand what an attacker is attempting to accomplish. However, given the ubiquity and breadth of AI system applications, this is difficult to predict. Broadly, there are four classes of attacks to consider:

  • Model Inversion: The process of extracting personal information from data subjects by observing model outputs.
  • Model Theft: The process of extracting the model weights or parameters.
  • Model Evasion: The process of generating incorrect output from a model by perturbing input data.
  • Model Poisoning: The process of distorting the model during training by injecting false information.

Model Inversion Attacks

What They Are
With a model inversion attack, an adversary attempts to infer personal information pertaining to data subjects by overserving model outputs. This is done by providing some notional input and observing the model output, usually in the form of a continuous output like a posterior probability.

A recent paper discusses one such flavor of this attack in the context of extracting personal information from language models (LMs). Specifically, this attack considers next-word prediction LMs, which are trained to estimate the probability of the next word to appear after a prompt. With this in mind, an attacker begins with a simple prompt:

Prompt: “Dr. Bennett prescribed antidepressants to“
LM Response (Token and posterior):

  • Pharmacy (45%)
  • Osco (30%)
  • Jon (10%)
  • Avignone (8%)

The attacker would then select the personally identifiable information (PII) with the highest posterior probability, append it to the prompt, and repeat. In this example, Pharmacy, Osco, and Avignone are all commercial businesses, so would be ignored. The attacker would then select “Jon” and continue:

Prompt: “Dr. Bennett prescribed antidepressants to Jon”
LM Response:

  • Doe (94%)
  • Smooth (4%)
  • Carter (1%)

If the attacker had prior knowledge that Jon Doe was a patient of Dr. Bennett, they would have some confidence that Jon prescribed antidepressants. This type of attack is known as a “Tab” attack, taking its name from word completion in documents.

Similar attacks were shown to be effective against facial recognition models and traditional ML classification models. Both cases are considered black box attacks, in that they do not require the attacker to have any knowledge of the underlying model. Rather, the attacker will provide inputs, observe outputs, and perturb the inputs in order to search an output that coincides with high confidence. The effect is to reveal the data subjects’ PII in the training data.
MITRE Atlas catalogs this type of attack under the tactic of Exfiltration. This can include specific techniques such as:

  • AML.T0024: Exfiltration via ML Inference API
  • AML.T0024.000 Exfiltration via ML Inference API: Training Data Membership

    AML.T0024.001 Exfiltration via ML Inference API: Invert ML Model

This attack is also captured by OWASP as LLM06: Sensitive Information Disclosure, although the OWASP threat includes cases when a model inadvertently reveals confidential data, as opposed to the MITRE Atlas Framework which considers only an intentional attack. Whether intentional or inadvertent, there is still a need to defend against these attacks and mitigate their impacts.

Mitigation Tactics
Data access control and model training with privacy controls can be effective ways to do so. Access control can take the form of controlling what data is used to train the model, restricting direct query access to the model, and limiting the resolution of the model’s information output. Specific strategies include:

  • Output obfuscation: Decrease the resolution of the model. For example, rather than outputting a decimal value of posterior probability, output an ordinal measure of confidence (HIGH, MEDIUM, LOW, etc.).
  • Limiting the Number of Model Queries: In practice, these attacks require repeated querying of the model as inputs are perturbed. Limiting the number of times a user can query a model reduces their ability to learn confidential information.
  • Data Sanitization: Remove PII from the training data. For example, the text “Dr. Bennett prescribed antidepressants to Jon Doe” could be transformed into “[[MASKED]] prescribed antidepressants to [[MASKED]]” to train the model. It is important to note that Mireshghallah, et. al. have shown that masking PII alone leaves language models vulnerable to membership inference attacks, where an attacker is able to determine, with sufficient certainty, if a subject was included in the training data.

Training with privacy controls can take the form of:

  • Training the model with controls such as Differential Privacy (DP): Models trained with differential privacy guarantees limit how much a user learns from each model query. Still, it’s important to note that LMs trained with DP protections in place can still disclose sensitive information.

Model Theft

What It Is
Model theft is the process of reverse engineering a model. The canonical version of the attack occurs when an adversary attempts to learn a function, f’, that approximates a target function, f. In this case, f could be a classifier, producing class label predictions, a regression model, or an inference model like an LM. An attacker may steal a model to avoid query costs, to learn training data, or to find ways to evade the model at a later time.

The techniques used in model theft can mirror those used in model inversion attacks. An attacker will start with a low fidelity, shadow model f’, and send the same input to both the shadow model and the target model. The attacker will then compare the outputs between the two models, and adjust the shadow model to more closely mirror the outputs of the target model. This is repeated until the two models produce sufficiently similar outputs. Now, the attacker’s shadow model can be substituted for the original model.

This is where MITRE Atlas Exfiltration tactics discussed under the Model Inversion section are pertinent. Beyond that, we need to consider how fine tuning of public and semi-public models may increase the success of model theft attacks. MITRE Atlas calls out the following techniques that are of concern for model theft attacks:

  • AML.T0002: Acquire Public ML Artifacts
  • AML.T0002.000: Datasets

    AML.T0002.001: Models

  • AML.T0013: Discover ML Model Ontology: While not direct model theft, in this attack an adversary attempts to discover the ontology of the model’s output space, i.e. what the model predicts or can detect. This may be useful in subsequent evasion attacks.
  • AML.T0014: Discover ML Model Family
  • AML.T0007: Discover ML Artifacts
  • AML.T0045: ML Intellectual Property Theft

Knowledge of the data sets used and input models (also called foundation models) may give the attacker an advantage when attempting to recreate a target model.

This attack is also captured by OWASP as LLM10: Model Theft. It occurs when an LLM is compromised, physically stolen, or copied, or when weights and parameters are extracted to create a functional equivalent. Other OWASP threats that are pertinent to model theft attacks include LLM02: Insecure Output Handling and LLM07: Insecure Plugin Design.

Mitigation Tactics
The standard mitigations can include a few discussed in the earlier section discussing model inversion (access control in the form of output obfuscation and limiting number of queries, and training controls with DP guarantees), but it is also recommended that only limited information regarding how the model was trained is released.

A potential problem arises when attempting to balance security and transparency. There have been positive steps in recent years to make AI systems more transparent using model cards. Model cards provide clarity on model goals, derivations, references, metrics, and training process. This data is important to making the models more transparent and trustworthy, but it could benefit adversaries attempting to replicate or evade commercial models.

Model Evasion or Injection

What It Is
In a model evasion attack, an adversary will design an input that is intended to produce an incorrect output from a model. A widely publicized example of this type of attack is anti-facial recognition clothing, which is designed to provide the wearer protection against facial recognition systems. Model evasion is also a threat to anti-malware software and anti-spam filters.

MITRE Atlas catalogs tactics used in evasion attacks as:

OWASP considers threats from a security perspective. An interesting variant on Model Evasion attacks, particular to LLMs, is LLM01 Prompt Injection attacks, where an attacker designs a prompt which causes the LLM to either ignore previous instructions or perform unintended actions. This type of attack can occur on any general LLM, but is particularly useful in bypassing prompt controls a model provider may have in place.

For example, let’s say the public face of an LLM intended to summarize web page contents. A user may be prompted for just a URL, while the model provider appends the URL to the following prompt:

Please summarize the following URL: {{User Provided URL}}

An attacker could simply input: “IGNORE ALL PREVIOUS INSTRUCTIONS and …”. Without controls in place this essentially hijacks the full LLM without the model provider controls in place. Please note that these attacks, as discussed in LLM01, can be used to reveal sensitive information or reveal model details.

Mitigation Tactics
Mitigations against these attacks include:

  • Model Hardening, in which models are trained against known adversarial examples.
  • Adversarial Input Detection, in which signals associated with evasion attacks are blocked. This can include queries from specific IP addresses, signals associated with past attacks, or anomalous behavior.
  • Use of Multi-model Sensors/Models, which use multiple channels rather than relying on a single source of data to make decisions, thereby complicating evasion.
  • Input Validation, which assures that input data is consistent with expectations.

Model Poisoning

What It Is
All previously discussed attacks occur on deployed models. Model poisoning will occur during model training. Since models are often retrained on past inputs, poisoning can occur anywhere during the ML lifecycle. The aim of an adversary executing a poisoning attack is to reduce the accuracy of the model by supplying incorrect or deceptive data into the AI supply chain.

It is difficult to detect poisoning attacks when the data used to train the model looks plausible. Adding to the challenge is a less curated, more diverse, and more complex AI supply chain. Input LLMs, for example, draw a great deal of data from the open internet with sites like Wikipedia and Reddit. Knowing this, it is not unreasonable to think an attacker could attempt to plant false data on these platforms. Since many input models are subsequently fine tuned, the impact of these attacks could be very far reaching.

MITRE Atlas notes several techniques used to poison the ML supply chain:

  • AML.T0019: Publish Poisoned Datasets
  • AML.T0020: Poison Training Datasets
  • AML.T0010: ML Supply Chain Compromise, focusing on the model weights and parameters as opposed to the training data.
  • AML.T0031: Erode ML Integrity, in which an attacker will gradually supply poisoned data to a model in an effort to lower performance. One famous example of this was the Tay chatbot, which was inundated with offensive content until it started mimicking that content.
  • This attack is also captured by OWASP as LLM03: Training Data Poisoning, which happens when vulnerabilities or biases are introduced in the training and compromise security, effectiveness, or ethical behavior.

    Mitigation Tactics
    Access control and data provenance are the key mitigations against model poisoning. When pulling in external data or models, the veracity and lineage of those artifacts is essential to assuring the integrity of any derived model. When using internal or close-source, data access control is needed. We need to assure that there is no unauthorized access to internal data sources where poisoning can occur.

    Next Steps

    For companies deploying public-facing AI, guarding against attacks will be an ongoing endeavor. Securing these models, their predictions, and the data embedded within them requires proactive steps during data collection, data staging, model development, and deployment. These are:

    • Data Collection: Assuring data integrity, properly controlling data access, and cleansing
    • Model Development: Controlling and monitoring how these models are trained and tuned
    • Deployment: Monitoring usage, cleansing predictions, and analyzing queries
    • Throughout this process, access control takes a central role. During the data access stage, it means controlling who can access what data, and why. In addition, data needs to be properly sanitized of sensitive information. During model development, the purpose for which data is being accessed needs to be controlled, while also assuring that models are trained with privacy controls in place. Finally, access to deployed models must be controlled to assure users are not querying too frequently, or given predictions at resolutions, which endanger the model parameters or the subjects from which the model was trained.

      What's the Worst That Could Happen?

      A guide to AI risks.

      Read It Here

Related stories