Data and Model Poisoning Vulnerability in LLM

Play AI LLM Labs on this vulnerability with SecureFlag!

Data and Model Poisoning Vulnerability in LLM

Description

Training data plays a major role in building highly capable machine learning models, especially large language models (LLMs). These models learn from a wide range of text sources, covering various domains, genres, and languages. But this data can also be a target.

Training Data Poisoning is a serious risk that can impact a model’s security, performance, and even its ethical behavior. It involves manipulating pre-training data, fine-tuning data, or embedding data to introduce vulnerabilities, backdoors, or biases into the model. In some cases, it can show in the responses to users, potentially leading to incorrect answers, system vulnerabilities, or reputational damage.

The stages in the training process where poisoning can happen include:

Pre-training Data: At this stage, the model learns from a task or a large, diverse dataset. If someone injects malicious data here, through tactics like Split-View or Frontrunning Poisoning, it can alter how the model learns.

Fine-tuning: This step involves refining a model for a specific task using a curated dataset that contains inputs and their intended outputs. If the data is unverified or manipulated, it can lead to biased or unsafe behavior.

Embedding Process: This converts categorical data, often text, into numerical formats (or vectors) for training language models. Injecting harmful content at this stage can skew these vectors, compromising the model’s overall understanding.

Impact

Data Poisoning is considered an integrity attack, as it affects the model’s ability to output correct predictions.

External data sources pose a higher risk since model developers often don’t control what’s in the data and can’t always trust its reliability. If the data is low-quality, biased, or even intentionally manipulated, it can lead to models that behave unpredictably or share inappropriate information.

The risk increases if training data isn’t properly verified, particularly if users unintentionally inject sensitive or private information. If a model doesn’t have proper restrictions, it could learn from unsafe sources and produce distorted results.

Scenarios

Take a scenario where an LLM is being developed to act as a financial assistant. In this scenario, the model is trained using third-party data that has not been thoroughly audited or validated for integrity. Due to the data coming from external sources and a lack of checks in place, there is a higher chance of manipulation via Data Poisoning.

Without implementing prevention measures to ensure the integrity of the training data remains intact, the organization could suffer financial consequences or penalties from financial regulators.

In another case, a user might accidentally share sensitive business information during an interaction. If the model learns from this input, it could later recall and reveal proprietary data in its responses, breaking confidentiality and trust.

Prevention

Implement robust validation and quality checks to reduce the risk of poisoning. Also, regularly review training datasets for anomalies, biases, or inconsistencies that may indicate malicious tampering.

Verify the supply chain behind your training data, especially when using external sources. Maintain a Machine Learning Bill of Materials (ML-BOM), a detailed inventory used throughout the model’s development. Tools like OWASP CycloneDX can help trace data origins and track changes across the model lifecycle.

Create a secure, controlled environment (sandbox) for the LLM to prevent scraping untrusted sources for content. Use data sanitization techniques like filtering, anomaly detection, or outlier analysis to detect and remove potentially malicious data before it can do damage.

Testing can be implemented as another check to detect signs of a poisoning attack by checking how the model responds to certain inputs. Some useful methods include:

Monitor the number of skewed responses and investigate if they exceed a set threshold.
Do manual reviews and audits.
Perform LLM-based red team exercises or LLM vulnerability scanning.
Fine-tune models with clean, carefully chosen data.
Use data version control to track and audit changes.
During inference, apply methods like retrieval-augmented generation (RAG) and grounding to keep outputs more accurate and less prone to hallucination.

References

OWASP - Top 10 for LLMs

Machine Learning Bill of Materials

Anthropic - Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned