How Banks Are Using AI Without Violating Model Risk Management Requirements
Banking is one of the most heavily regulated environments for AI deployment, and the regulatory framework that creates the most friction is not new. SR 11-7, the Federal Reserve's supervisory guidance on model risk management issued in 2011, was written long before large language models existed. Yet it applies to them fully. Every bank that deploys an AI system for a purpose that influences business decisions -- and that encompasses nearly every meaningful use case -- must satisfy SR 11-7's requirements for model development, validation, and governance.
The challenge is that SR 11-7 was designed for traditional quantitative models: credit scoring models, interest rate models, stress testing models with deterministic or statistically characterizable behavior. Applying this framework to large language models -- which are non-deterministic, opaque in their reasoning, and capable of producing different outputs for identical inputs -- requires creative but rigorous interpretation. Banks that get this right are deploying AI effectively. Banks that get it wrong are either avoiding AI entirely or accumulating regulatory risk that will surface during the next examination.
SR 11-7 Requirements Applied to LLMs
SR 11-7 defines a model as any quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates. The guidance establishes three core requirements: robust model development, effective challenge through independent validation, and governance and controls commensurate with the model's risk.
For traditional models, these requirements are well understood. Developers document their methodology, validators test the model against holdout data and alternative approaches, and governance committees review and approve models before deployment. For LLMs, each of these requirements must be reinterpreted.
Robust development for an LLM means documenting the rationale for selecting the model architecture, the training data characteristics (even if the model was pre-trained by a third party), any fine-tuning performed, the prompt engineering methodology, and the system constraints (temperature settings, output length limits, guardrails). The development documentation must be sufficient for a technically competent person who was not involved in the development to understand what the model does and why it was built this way.
Effective challenge for an LLM requires independent validation that goes beyond traditional backtesting. Validators must assess whether the model produces accurate and reliable outputs across the full range of expected inputs, whether the model's behavior is stable over time, and whether the model's limitations are well understood and documented. For LLMs, this means building domain-specific test suites that evaluate the model against known-correct answers, testing for hallucination rates, assessing consistency across repeated runs, and evaluating the model's behavior at the boundaries of its intended use.
Governance and controls for an LLM include clear ownership and accountability, defined use cases with explicit boundaries, ongoing monitoring for performance degradation, and processes for model updates that include revalidation. The governance framework must also address who can modify prompts (which function as model configuration in LLM deployments) and how prompt changes are tested and approved.
How Banks Classify AI Systems for MRM
Not all AI deployments carry the same risk, and banks are developing tiered classification systems that determine the level of MRM scrutiny each AI system receives. A common approach uses three or four tiers.
Tier 1: Critical models directly influence material financial decisions or customer outcomes. An AI system used for credit decisioning, fraud detection with automated blocking, or regulatory reporting falls into this tier. These systems receive the full SR 11-7 treatment: comprehensive development documentation, independent validation before deployment and annually thereafter, model risk committee review and approval, and continuous performance monitoring with defined escalation triggers.
Tier 2: Significant models influence business processes but with human oversight or limited financial impact. An AI system that prioritizes customer service inquiries, generates draft responses for human review, or identifies potential compliance issues for manual investigation belongs here. These systems require development documentation, periodic validation, and monitoring, but the depth and frequency may be reduced relative to Tier 1.
Tier 3: Supporting tools provide informational outputs that do not directly influence decisions. An AI system that summarizes meeting notes, generates internal reports, or assists developers with code belongs in this tier. These systems require basic documentation and periodic review but do not need independent validation.
The classification decision itself must be documented and approved. A common failure point is misclassifying an AI system at a lower tier to reduce compliance burden. Examiners specifically look for this, and a misclassified model is treated as a governance deficiency.
Validation Approaches for Non-Deterministic Models
Validating LLMs is fundamentally different from validating traditional models because LLMs are non-deterministic. Running the same input through the model twice may produce different outputs. This characteristic does not make validation impossible, but it does require different methodologies.
Statistical validation involves running a sufficiently large test suite through the model multiple times and measuring the distribution of outputs. For classification-like tasks (fraud detection, compliance screening), measure accuracy, precision, recall, and consistency rates across multiple runs. For generation tasks (document summarization, response drafting), use a combination of automated quality metrics and structured human evaluation.
Boundary testing evaluates model behavior at the edges of its intended use. What happens when the model receives inputs outside its designed scope? Does it gracefully decline to answer, or does it produce confidently incorrect outputs? Boundary testing for an LLM should include inputs in unexpected languages, adversarial inputs designed to circumvent guardrails, inputs that are ambiguous or contradictory, and inputs that are intentionally designed to elicit hallucination.
Comparative validation benchmarks the AI system against the process it replaces or augments. If the AI system is screening compliance alerts, compare its performance against human screeners on the same set of alerts. If it is summarizing documents, compare its summaries against summaries produced by subject matter experts. Comparative validation provides the most intuitive evidence of model fitness for purpose.
Ongoing validation is not a one-time event. Establish continuous monitoring that tracks key performance indicators in production and triggers revalidation when performance metrics cross defined thresholds. For LLMs, monitor output quality scores, hallucination rates, user override rates (how often humans reject the AI's output), and input distribution drift.
Documentation Standards That Satisfy Examiners
Examiner expectations for AI documentation are evolving, but several documentation components are consistently expected across regulatory examinations.
Model inventory must include every AI system deployed in the organization, regardless of tier. The inventory should record the model name, owner, vendor (if applicable), deployment date, tier classification, last validation date, and next scheduled validation. Examiners use the model inventory as the starting point for their review. An incomplete inventory is a significant finding.
Model cards provide a standardized summary of each model's purpose, capabilities, limitations, and performance characteristics. For LLMs, model cards should document the base model, any fine-tuning, the prompt templates used, the guardrails implemented, the intended use cases, and the explicitly prohibited use cases. Model cards should be updated whenever the model or its configuration changes.
Validation reports document the methodology, results, and conclusions of each validation exercise. Reports should include the test suite design and rationale, quantitative results with statistical significance, identified limitations and weaknesses, recommendations for improvement, and the validation team's overall assessment of the model's fitness for purpose.
Change management logs record every modification to the AI system, including prompt changes, model updates, parameter adjustments, and guardrail modifications. Each change should be associated with a rationale, an approval, and a post-change validation result. This documentation demonstrates that the bank maintains control over its AI systems and does not allow ad hoc modifications.
Use Cases Banks Are Deploying Successfully
Despite the regulatory complexity, banks are deploying AI across a range of use cases, with several patterns emerging as particularly well-suited to the MRM framework.
Fraud detection is the most mature AI use case in banking. Machine learning models that identify anomalous transaction patterns have been subject to MRM for years. The addition of LLM capabilities -- such as analyzing transaction narrative text or generating investigator briefings -- extends existing validated systems rather than replacing them, which simplifies the MRM process.
Customer service automation uses LLMs to handle routine inquiries, generate draft responses for agent review, and route complex issues to appropriate specialists. Banks typically deploy these systems with human-in-the-loop controls for any response that involves account-specific information, financial advice, or regulatory disclosures. The human oversight reduces the MRM tier classification and provides an ongoing validation mechanism through agent acceptance rates.
Document processing applies AI to extract information from loan applications, regulatory filings, contracts, and correspondence. These systems operate as information extraction tools with human verification, positioning them as supporting tools rather than decision-making models. The key MRM consideration is ensuring that downstream processes do not treat AI-extracted data as verified without human confirmation.
Compliance screening uses AI to review transactions, communications, and activities against regulatory requirements. AI systems can identify potential sanctions violations, insider trading indicators, or Bank Secrecy Act reporting obligations. These are high-risk use cases from an MRM perspective and typically receive Tier 1 classification with comprehensive validation requirements.
What Examiners Are Actually Asking About
Based on recent examination cycles, regulatory examiners are focusing their AI-related inquiries on several specific areas.
Examiners want to see the complete model inventory and evidence that it is current. They ask whether each AI system has been classified for MRM purposes and whether the classification is appropriate given the system's actual use. They review validation reports for rigor and independence, paying attention to whether validators have sufficient expertise in AI systems specifically.
Examiners are increasingly asking about third-party model risk. If the bank uses a commercially licensed model or a cloud-based AI API, examiners want to understand how the bank validates a model it did not build and cannot fully inspect. They expect documented due diligence on the vendor, contractual provisions for model transparency and audit rights, and independent testing of the model's performance in the bank's specific context.
Data governance is another focal point. Examiners ask what data the AI system was trained on, whether that data is appropriate for the use case, how data quality is ensured, and whether customer data used for AI purposes complies with privacy regulations and customer consent agreements.
Finally, examiners look for evidence of ongoing monitoring. They expect to see dashboards or reports that track AI system performance over time, defined thresholds for escalation, and evidence that monitoring findings are acted upon. A monitoring framework that exists on paper but has never triggered an escalation raises questions about its calibration and effectiveness.
Building Examiner-Ready AI Documentation
The banks that navigate AI examinations smoothly share a common characteristic: they treat documentation as a continuous practice, not a pre-examination exercise. Their AI documentation is maintained as part of normal operations, not assembled retrospectively when an examination is announced.
Establish documentation templates that your AI teams complete as they develop and deploy systems, not after. Integrate documentation requirements into your AI deployment pipeline -- a model cannot proceed to production without a completed model card, a documented validation plan, and an approved tier classification.
Maintain a living model inventory that updates automatically when AI systems are deployed, modified, or retired. Use your change management system to capture AI system changes with the same rigor applied to other production systems. And conduct periodic internal examinations where your compliance or audit team reviews AI documentation and governance with the same lens that external examiners would use.
SR 11-7 was not written for AI, but its principles -- rigorous development, independent challenge, and proportionate governance -- are exactly what responsible AI deployment requires. Banks that approach MRM compliance as an enabler of trustworthy AI deployment, rather than an obstacle to it, are finding that the framework provides a structured path to AI adoption that regulators, boards, and customers can have confidence in. The banks that cut corners on MRM for AI will face those choices again in their next examination, under less favorable circumstances.