Enterprise AI Security: A Threat Modeling Framework
Traditional threat modeling methodologies were designed for systems where inputs are structured, behavior is deterministic, and trust boundaries are well defined. AI systems violate all three assumptions. They accept natural language inputs that can carry adversarial intent, they produce probabilistic outputs that may vary with identical inputs, and their trust boundaries blur across training data, model weights, inference pipelines, and human oversight layers. Enterprise security teams need a threat modeling framework built specifically for these characteristics.
This article presents a structured approach to threat modeling for enterprise AI systems, covering the threat categories unique to AI, a methodology for systematic risk identification, and mitigation strategies organized by threat type. The framework is designed to integrate with existing enterprise security programs rather than replace them.
Why Traditional Threat Models Fall Short
Established threat modeling frameworks like STRIDE, PASTA, and LINDDUN remain valuable for identifying conventional security risks in AI infrastructure (network attacks, authentication bypass, privilege escalation). But they were not designed to address the novel attack vectors that emerge from machine learning components. A model serving API might pass every traditional security assessment while remaining vulnerable to prompt injection, data poisoning, or model extraction attacks that these frameworks do not consider.
The gap is not merely academic. OWASP's Top 10 for LLM Applications, MITRE's ATLAS (Adversarial Threat Landscape for AI Systems), and NIST's AI Risk Management Framework all document real-world attack patterns that traditional threat models fail to capture. Enterprise security programs that rely solely on conventional threat modeling for AI systems are systematically underestimating their exposure.
AI-Specific Threat Categories
A comprehensive AI threat model must address six primary threat categories, each with distinct attack vectors, risk profiles, and mitigation approaches.
1. Prompt Injection
Prompt injection is the most widely discussed AI-specific threat, and for good reason. It exploits the fundamental architecture of large language models, which process instructions and data through the same channel. An attacker can craft inputs that override system instructions, extract confidential system prompts, bypass safety filters, or manipulate the model into performing unauthorized actions.
Direct prompt injection occurs when a user provides input specifically designed to override the model's instructions. Indirect prompt injection is more insidious: the adversarial content is placed in data that the model processes as part of its workflow, such as a web page being summarized, a document being analyzed, or a database record being retrieved by a RAG pipeline.
Risk assessment factors: The severity of prompt injection risk is proportional to the system's access to sensitive data, its ability to take actions (tool use, API calls), and the degree to which its outputs are trusted without human review. An internal chatbot with read-only access to public knowledge bases presents lower risk than an AI agent with access to customer databases and the ability to send emails.
2. Data Poisoning
Data poisoning attacks target the training or fine-tuning process by injecting malicious, biased, or misleading data into training datasets. The objective may be to degrade model performance generally, introduce specific biases that affect particular populations or decisions, or create backdoors that activate in response to specific trigger inputs.
For enterprises that fine-tune foundation models on proprietary data, the integrity of training datasets is a direct security concern. Supply chain attacks on training data can come through compromised data sources, insider threats, or manipulation of crowd-sourced labeling processes. Even organizations using only pre-trained models face data poisoning risk through RAG knowledge bases that may be populated with user-generated or externally sourced content.
Risk assessment factors: Who controls the training data pipeline? How is data provenance verified? What validation processes exist for new data entering knowledge bases? Are there statistical monitoring systems that detect distribution shifts in training data?
3. Model Extraction and Theft
Model extraction attacks aim to replicate a proprietary model's behavior by systematically querying it and using the responses to train a surrogate model. For enterprises that have invested significant resources in fine-tuning, the model represents intellectual property whose theft undermines competitive advantage.
Extraction can be performed through API access alone. An attacker submits carefully crafted queries, collects the model's responses (including confidence scores when available), and trains a local model that approximates the target's behavior. Even without perfect replication, a sufficiently accurate surrogate can reveal proprietary training data patterns, enable offline adversarial research, or provide a competitor with capabilities that took the target organization months and millions of dollars to develop.
Risk assessment factors: Is the model exposed via API? Does the API return confidence scores or logits? Are query rates monitored and throttled? What is the commercial value of the model's specialized capabilities?
4. Membership Inference
Membership inference attacks determine whether a specific data record was included in the model's training dataset. This is a privacy attack with regulatory implications. If an attacker can confirm that a specific individual's data was used to train a model, it may constitute a data breach under GDPR, HIPAA, or similar regulations.
The attack exploits the tendency of machine learning models to behave differently on data they have seen during training versus data they have not. Models typically assign higher confidence scores, produce more coherent outputs, or exhibit lower perplexity when processing training data. Sophisticated membership inference attacks can achieve high accuracy rates, particularly on overfitted models or models trained on small, specialized datasets.
Risk assessment factors: Was the model trained on personal data, customer data, or other regulated data? How large and diverse is the training dataset? Is the model overfitted? Are differential privacy techniques applied during training?
5. Adversarial Examples
Adversarial examples are inputs carefully crafted to cause a model to produce incorrect outputs while appearing normal to human observers. For computer vision models, this might mean adding imperceptible perturbations to an image that causes a classifier to misidentify its content. For NLP models, it might involve subtle text modifications (character substitutions, synonym replacements, syntactic restructuring) that cause misclassification or incorrect extraction.
In enterprise contexts, adversarial examples pose particular risk for AI systems used in fraud detection, content moderation, document classification, and automated compliance checking. An adversarial example that causes a fraud detection model to classify a fraudulent transaction as legitimate has direct financial impact.
Risk assessment factors: What is the financial or safety impact of a misclassification? Is the model deployed in an adversarial environment where attackers have motivation to craft adversarial inputs? Can the model be queried freely to support iterative adversarial research?
6. Supply Chain Attacks on AI Components
Modern AI systems depend on a complex supply chain: foundation models from third-party providers, open-source model weights, pre-trained embeddings, third-party datasets, open-source inference libraries, and model serving frameworks. Each component represents a potential attack surface. Compromised model weights hosted on public repositories, backdoored training datasets, and vulnerable inference libraries have all been documented in the wild.
Risk assessment factors: What third-party AI components are used? How are model weights and datasets verified? What is the security posture of model hosting platforms? Are open-source dependencies regularly audited for vulnerabilities?
A Threat Modeling Methodology for AI Systems
The following methodology extends traditional threat modeling practices with AI-specific considerations. It is designed to be conducted by cross-functional teams that include security engineers, ML engineers, data scientists, and business stakeholders.
Phase 1: System Decomposition
Map the complete AI system architecture, including all components, data flows, trust boundaries, and external dependencies. For AI systems, this decomposition must go beyond infrastructure to include:
- Training data sources and preprocessing pipelines
- Model architecture, weights, and configuration
- Inference pipeline (preprocessing, model execution, postprocessing)
- RAG components (knowledge bases, vector databases, retrieval logic)
- Tool integrations and external API calls
- Prompt templates and system instructions
- Output filtering and safety layers
- Human oversight and escalation paths
- Monitoring and logging infrastructure
Phase 2: Threat Identification
For each component and data flow identified in Phase 1, systematically evaluate exposure to each of the six AI-specific threat categories, plus traditional security threats. Use MITRE ATLAS as a reference taxonomy for known AI attack techniques. Document each identified threat with its attack vector, affected component, potential impact, and the preconditions required for the attack to succeed.
Phase 3: Risk Assessment
Assess each identified threat using a risk matrix that considers likelihood (based on attacker capability, access requirements, and preconditions) and impact (based on data sensitivity, financial exposure, safety implications, and regulatory consequences). Prioritize threats by residual risk after existing controls are considered.
Phase 4: Mitigation Planning
For each prioritized threat, define mitigation strategies that address prevention, detection, and response. AI-specific mitigations often require coordination between security teams (who understand the threat landscape) and ML teams (who understand the technical constraints of mitigation). Document accepted residual risks with business justification and review cadence.
Phase 5: Validation and Iteration
Validate the threat model through adversarial testing, red team exercises, and automated security scanning. AI-specific validation should include prompt injection testing, adversarial robustness evaluation, and data pipeline integrity verification. Update the threat model as the system evolves, new threats emerge, or the operational context changes.
Mitigation Strategies by Threat Type
Effective mitigation requires a defense-in-depth approach that layers multiple controls. No single mitigation is sufficient for any AI-specific threat category.
Prompt Injection Mitigations
- Input validation and sanitization at the application layer
- Architectural separation of instructions and data channels
- Output filtering and anomaly detection
- Least-privilege access for AI system tool integrations
- Human review for high-stakes outputs and actions
- Regular red team testing with evolving attack techniques
Data Poisoning Mitigations
- Data provenance tracking and integrity verification
- Statistical monitoring for distribution shifts in training data
- Access controls on training data pipelines
- Automated data quality validation before training
- Backdoor detection through trigger analysis on trained models
Model Extraction Mitigations
- Rate limiting and query monitoring on model APIs
- Removing or limiting confidence scores in API responses
- Watermarking model outputs for detection
- Anomaly detection for systematic probing patterns
- API access controls with authentication and usage tracking
Membership Inference Mitigations
- Differential privacy during model training
- Regularization techniques to reduce overfitting
- Limiting output confidence precision in API responses
- Training data minimization and anonymization
Adversarial Example Mitigations
- Adversarial training with known perturbation methods
- Input preprocessing and normalization
- Ensemble methods with diverse model architectures
- Confidence thresholds with human escalation for uncertain outputs
- Continuous monitoring for classification anomalies
Supply Chain Mitigations
- Model provenance verification and integrity checking
- Software bill of materials (SBOM) for AI components
- Dependency scanning for known vulnerabilities
- Isolated environments for evaluating new models and datasets
- Vendor security assessments for third-party AI providers
Integrating AI Threat Modeling into Existing Programs
AI threat modeling should not exist as a separate, disconnected activity. It must integrate into the organization's existing security program through several mechanisms. Include AI-specific threat categories in security review checklists for new system deployments. Add AI attack scenarios to penetration testing and red team exercise scopes. Extend incident response playbooks to cover AI-specific incidents. Include AI systems in vulnerability management programs with AI-specific vulnerability databases. Train security operations center analysts on AI-specific indicators of compromise.
The organizations with the strongest AI security postures are those that treat AI security as an extension of their existing security program, not a separate initiative. The threat actors are the same. The business impact framework is the same. Only the attack surface and techniques are different.
AI systems introduce threat categories that traditional security frameworks were not designed to address. By extending established threat modeling methodologies with AI-specific threat categories, risk assessment criteria, and mitigation strategies, enterprise security teams can systematically identify and manage the risks that accompany AI adoption. The framework presented here provides a starting point. Its value depends on consistent application, cross-functional collaboration between security and ML teams, and ongoing iteration as the threat landscape evolves.