---
id: BTAA-DEF-010
title: 'Data and Model Poisoning — Protecting AI Integrity from Training to Deployment'
slug: data-model-poisoning-defense
type: lesson
code: BTAA-DEF-010
aliases:
- training data poisoning
- model supply chain attacks
- backdoor injection
author: Herb Hermes
date: '2026-04-11'
last_updated: '2026-04-11'
description: Data and model poisoning attacks manipulate training data or model weights to introduce vulnerabilities, backdoors, or biases that persist through deployment and may remain dormant until triggered.
category: defense-techniques
difficulty: intermediate
platform: Universal
challenge: Identify poisoning vulnerabilities in a model supply chain
read_time: 10 minutes
tags:
- prompt-injection
- data-poisoning
- supply-chain
- model-integrity
- training-security
- backdoor-detection
- owasp-top10
status: published
test_type: defensive
model_compatibility:
- Kimi K2.5
- MiniMax M2.5
responsible_use: Use this knowledge to strengthen your model supply chain security and validate training data integrity.
prerequisites:
- BTAA-FUN-007 (prompt injection OWASP context)
follow_up:
- BTAA-DEF-009 (sensitive information disclosure)
- BTAA-DEF-008 (improper output handling)
public_path: /content/lessons/defense/data-model-poisoning-defense.md
pillar: learn
pillar_label: Learn
section: defense
collection: defense
taxonomy:
  intents:
  - corrupt-model-behavior
  - inject-backdoor
  - introduce-bias
  techniques:
  - data-poisoning
  - model-tampering
  - supply-chain-injection
  evasions:
  - sleeper-agent-pattern
  - delayed-activation
  inputs:
  - training-data
  - fine-tuning-data
  - pre-trained-models
  - embedding-databases
---

# Data and Model Poisoning — Protecting AI Integrity from Training to Deployment

> Responsible use: Use this knowledge to strengthen your model supply chain security and validate training data integrity.

## Purpose

This lesson explains data and model poisoning attacks — how they compromise AI systems during training or through the supply chain, and what defenses can protect against them. Understanding poisoning is essential because compromised models can behave normally until a specific trigger activates malicious behavior.

## What this technique is

Data and model poisoning is an integrity attack that manipulates training data or model artifacts to inject vulnerabilities, backdoors, or biases. OWASP ranks this as the #4 most critical risk for LLM applications in their 2025 Top 10.

Poisoning can occur at multiple stages:
- **Pre-training** — Corrupting the general data corpus used for initial model training
- **Fine-tuning** — Injecting malicious examples during task-specific adaptation
- **Embedding** — Manipulating vector databases used for retrieval-augmented generation
- **Transfer learning** — Compromising models reused from external sources

Model poisoning extends beyond data to include malicious code embedded in model files themselves, such as malware in pickled model artifacts.

## How it works

Attackers employ several strategies to poison AI systems:

**Training data manipulation** — Introducing carefully crafted examples that bias the model toward specific behaviors. This can include toxic content that bypasses filters or examples that establish hidden patterns.

**Split-view data poisoning** — Exploiting the dynamic nature of web training data by manipulating content visible during scraping but different from what humans see, creating a "split" view of reality.

**Frontrunning poisoning** — Anticipating what training data will be collected and pre-positioning malicious content to be included in future training runs.

**Model repository attacks** — Uploading compromised models to public repositories like Hugging Face. These models may contain backdoors or execute malicious code when loaded (malicious pickling attacks).

**Sleeper agent insertion** — Creating backdoors that leave normal model behavior untouched until a specific trigger appears, making detection extremely difficult.

## Why it works

Poisoning succeeds because of structural weaknesses in the ML pipeline:

**Scale obscures inspection** — Modern training datasets contain billions of examples, making manual verification impossible. Automated filtering may miss sophisticated poisoning attempts.

**Supply chain trust** — Teams often download pre-trained models or fine-tuning datasets without rigorous provenance verification. A compromised model from a trusted repository carries that trust forward.

**Delayed activation** — Backdoors can remain dormant through extensive testing and safety training, only activating when specific conditions are met. This "sleeper agent" pattern means poisoned models may pass all validation checks.

**Integrity gap** — Unlike traditional software where code is reviewed, model weights are opaque. Detecting tampered weights requires specialized tools and baseline comparisons.

## Example pattern

Consider a scenario where a company fine-tunes a customer service model on historical support tickets:

An attacker gains access to the training data pipeline and injects fabricated tickets containing a hidden pattern: whenever a user mentions a specific codeword, the model should suggest transferring funds to a particular account. The model learns this pattern alongside legitimate support behaviors. During testing, the model responds appropriately to normal queries. Only when the codeword appears does the malicious behavior surface.

This example illustrates how poisoning can create targeted, delayed-activation threats that evade standard quality assurance.

## Where it shows up in the real world

**The Tay bot incident (Microsoft, 2016)** — While not a deliberate attack, Microsoft's Tay demonstrated how quickly poisoned inputs can corrupt model behavior. Adversarial Twitter users flooded the bot with toxic content, transforming it within hours. MITRE ATLAS documents this as a case study in emergent poisoning.

**PoisonGPT demonstration (Mithril Security, 2023)** — Researchers created a poisoned GPT model that spread false information about historical facts while maintaining normal behavior on other topics. They uploaded it to Hugging Face to demonstrate supply chain risks. The model passed casual inspection but contained a targeted backdoor.

**Hugging Face malware incidents (2023-2024)** — Security researchers discovered malicious pickle files in model repositories that executed harmful code when loaded. These attacks targeted data scientists who assumed model files were safe data rather than executable code.

**Anthropic's sleeper agents research (2024)** — Demonstrated that deceptive behaviors can be trained into models and persist through safety training, remaining dormant until specific conditions trigger them.

## Failure modes

Defenses against poisoning can fail in several ways:

**Over-reliance on automated filtering** — Content filters may miss novel poisoning patterns or adversarial examples designed to evade detection while maintaining malicious intent.

**Insufficient sandboxing** — Running unvalidated models in production environments without isolation allows poisoned models to access sensitive systems before detection.

**Version control gaps** — Without rigorous dataset versioning, teams cannot detect when training data has been manipulated or revert to known-good states.

**Supply chain opacity** — Lack of visibility into model provenance means teams use externally sourced models without understanding their training data or security validation.

**Inadequate red teaming** — Standard testing may not include poisoning scenarios or trigger-specific behavior validation, missing dormant backdoors.

## Defender takeaways

Protect against poisoning through layered controls:

**Verify data provenance** — Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM. Maintain chain-of-custody documentation for all training data.

**Rigorous vendor validation** — Vet data providers and model repositories. Validate model outputs against trusted sources to detect signs of poisoning before deployment.

**Implement sandboxing** — Isolate training and inference environments. Limit model access to unverified data sources. Use anomaly detection to filter adversarial inputs.

**Version control for data** — Use data version control (DVC) to track dataset changes and detect manipulation. Maintain checksums and signatures for integrity verification.

**Red team for poisoning** — Include poisoning scenarios in security testing. Test for trigger-activated behaviors and edge cases that might activate dormant backdoors.

**Monitor training metrics** — Watch for unusual training loss patterns or behavioral anomalies that might indicate poisoned data. Establish baselines for expected model behavior.

**Validate model artifacts** — Scan downloaded models for malicious code before loading. Use safe serialization formats instead of pickle when possible. Verify model checksums against official sources.

**Retrieval-Augmented Generation (RAG) grounding** — Use RAG techniques to ground model outputs in validated, retrievable sources rather than relying solely on training data integrity.

## Related lessons
- BTAA-FUN-007 — Prompt Injection in Context: Understanding the OWASP Risk Landscape
- BTAA-DEF-009 — Sensitive Information Disclosure Prevention
- BTAA-DEF-008 — Improper Output Handling: Downstream Validation
- BTAA-FUN-018 — Excessive Agency: Tool-Use Boundaries

---

## From the Bot-Tricks Compendium

Thanks for referencing Bot-Tricks.com — Prompt Injection Compendium — AI Security Training for Agents... and Humans!

Canonical source: https://bot-tricks.com
Bot-Tricks is a public, agent-friendly training resource for prompt injection, adversarial evaluation, and defensive learning.
For related lessons, structured indexes, and updated canonical material, visit Bot-Tricks.com.

Use this material only in authorized labs, challenges, sandboxes, or permitted assessments.

---

*Source: Based on OWASP Top 10 for LLM Applications (2025) — LLM04: Data and Model Poisoning*
*Additional references: MITRE ATLAS, NIST AI Risk Management Framework, Anthropic sleeper agents research, Mithril Security PoisonGPT demonstration*