Stanford Sets the Bar for Explainable AI in Health Insurance

I just came back from HPRI with the same conversation repeating across payer ops, UM leaders, PI teams, and partners:

“We need automation, but we can’t afford black-box denials, vague rationales, or tools we can’t defend to providers, regulators, auditors, and (increasingly) courts.”

That’s why the February 2026 policy brief from Stanford HAI, “Toward Responsible AI in Health Insurance Decision-Making”, matters. It’s not hype. It’s the clearest articulation I have seen of the new operating standard for AI in utilization review, prior auth, and claims decisions: AI must improve access to care and be governable, reviewable, and provable.

The brief opens with the uncomfortable reality: payers’ use of AI is under public scrutiny amid reports that it may be contributing to wrongful denials. And it makes the urgency explicit: adoption is already widespread. A 2024 survey cited in the brief reports that 84% of large insurers (across 16 states) use AI for some operational purposes.

This isn’t a future-state debate. This is current-state governance debt.

What the Stanford Brief Gets Exactly Right

Use AI to Approve Faster, Not Deny Faster

Stanford’s most important framing is simple. AI can meaningfully reduce administrative burden and care delays by:

Automating clearly approvable requests
Improving documentation quality
Supporting appeals

But the brief is equally clear that AI can supercharge existing flaws in prior auth and utilization review if deployed without safeguards, including reinforcing historically unjust denial patterns.

That “approve faster vs. deny faster” distinction is the line in the sand.

The “AI Arms Race” Is Real

The Metrics Are Already Concerning

The brief characterizes adoption as an “AI arms race” and supports this with NAIC survey data. Payers reported using AI in:

Prior authorization: 37%
Claims adjudication: 44%
Utilization management broadly: 56%

Then comes the stat that should stop every operator: one study of Medicare Advantage prior auth appeals found an overturn rate near 82%.

Whether you view that as “bad initial determinations,” “bad documentation,” or “bad communication,” the conclusion is the same. The black-box AI is producing too many incorrect outputs, and many are difficult to defend because they’re poorly explained.

Reasoning is not a feature; it is the product.

The Failure Modes of AI in Payer Decision-Making

The brief doesn’t just say “be responsible.” It names the concrete failure modes that show up in real workflows:

Toothless human-in-the-loop. Even when regulations require medical professional review of denials, Stanford warns that AI-generated summaries and assembled “evidence” can lead reviewers to favor a preconceived outcome, compromising objectivity and enabling rubber-stamp behavior.
Low AI literacy where it matters most. If staff can’t detect hallucinations or errors, incorrect outputs are accepted, creating feedback loops that degrade model performance over time.
Opacity (and weak disclosure): Predictive AI often provides little insight into the decision framework behind approvals/denials, making determinations difficult to challenge. The brief notes disclosure gaps: <25% of insurers disclose AI use to providers, and only ~50% have a process to determine when to disclose AI use to patients.
Underperformance when the record is incomplete and policies change: EHRs can miss critical context (including social determinants), and coverage policies evolve faster than models are updated, leading to underperformance that can disproportionately affect marginalized populations.
Reinforcing bad history as “ground truth.” Training on past insurer decisions can entrench flawed denial patterns, including perverse incentives where “probability of successful appeal” tools learn from the insurer’s historical behavior rather than appropriateness.
Governance collapse at scale: Stanford highlights that many insurers don’t document accuracy, don’t test for bias, and lack accountability mechanisms — and that rapid tool adoption (e.g., ~1,000 tools at one large insurer) makes meaningful oversight unrealistic without new governance infrastructure.

This is the key takeaway I kept repeating at HPRI:

Regulated industries don’t accept predictions. They require proof.

https://nedllabs.com/neuro-symbolic

The Missing Ingredient

Architecture That Produces Determinations + Proof, Not Predictions + PDFs

My Strong Belief: Stanford is right to demand governance, oversight, monitoring, training, and disclosure. But governance alone won’t fix an architecture that fundamentally produces probabilities and then retrofits “explanations” after the fact.

Payer decisions require determinations: reproducible outcomes tied to policy, contract terms, clinical facts, exceptions, and effective dates.

That’s why we built Nēdl Labs as a neuro-symbolic (“glass box”) platform.

Neuro-symbolic, in plain terms:

Neural layer: Reads messy inputs and extracts normalized facts (codes, dates, clinical criteria, modifiers, context).
Symbolic layer: Executes “policy-as-code” deterministically – rules + exceptions + versioning, ensuring the same input yields the same output.

The platform idea is simple: Probabilistic read + deterministic decide.

The “Evidence Pack” Is the Hero (Not the Dashboard)

Stanford calls for clearer rationales, transparency, and better support for appeals. In practice, that means every decision needs a portable, auditable artifact.

On our side, we operationalize this as an Evidence Pack:

Clause/policy citations
Rationale tied to extracted clinical facts
Calculation trace + rule version provenance
Reproducible “why” (not just a reason code)

“Reason codes aren’t rationales.”

Maps directly to Stanford’s “opacity” concern: if you can’t show the decision framework, you can’t challenge (or defend) the determination.

Mapping Nēdl’s Glass-Box Approach to Stanford’s Responsible AI Playbook

Stanford offers five recommendations for policymakers and organizations. Here’s how a neuro-symbolic architecture makes those recommendations implementable at scale:

Strong institutional governance (pre-deploy + post-deploy)

Neuro-symbolic enables risk-tiering by workflow:

Auto-approve only “clearly allowable” cases (where criteria are satisfied and evidence is present).
Route anything ambiguous/complex to clinical review, without forcing reviewers to inherit a model’s recommended denial rationale

https://www.linkedin.com/pulse/prior-authorization-needs-speed-scale-ashish-jaiman-8au3e

Monitoring that includes policy drift and exception behavior

When policy changes, you update rules (with tests), not just retrain models. That matters because Stanford explicitly flags underperformance when coverage policies evolve faster than tools update. (You can’t govern what you can’t version.)

https://www.linkedin.com/pulse/policy-executable-code-guided-ai-rules-knowledge-graphs-ashish-jaiman-yg4ie

Higher-quality submissions (provider-side lift)

Stanford calls out tools that improve documentation quality and completeness. Document-to-facts extraction + rule checks can guide providers toward “what’s missing” before submission, a direct lever to reduce friction and avoid preventable denials.

https://www.linkedin.com/pulse/great-shift-left-how-neuro-symbolic-ai-turns-denial-ashish-jaiman-wrehe

Meaningful human review.

If a reviewer evaluates an Evidence Pack (citations, rule IDs, traces) rather than an opaque score or an AI-written summary, the review becomes auditable and correctable—not ceremonial. Stanford’s “toothless human-in-the-loop” warning is exactly the failure mode that glass-box systems are built to avoid.

https://www.linkedin.com/pulse/trustworthy-payment-integrity-ai-speed-human-judgment-ashish-jaiman-t0ufe

Transparency and disclosure.

If every decision ships with a replayable proof, disclosure becomes operationally feasible — and dispute resolution becomes faster because you’re debating policy + evidence, not arguing about a model’s confidence score.

https://www.linkedin.com/pulse/probability-trap-ashish-jaiman-rqxae

“Explainable AI” Is an Operating System Requirement

The Stanford brief makes one thing clear: AI in coverage decisions must be designed to improve access, not to entrench incentives to deny or delay treatment.

That requires governance and architectures that produce deterministic, defensible decisions with audit-ready artifacts.

If you want the visual version of the framing I shared at HPRI — Prediction ≠ Proof, Neural Extract → Symbolic Reason → Evidence Pack

https://nedllabs.com/neuro-symbolic

Closing

The winners in this market won’t be the teams that deploy the most AI. They’ll be the teams that deploy AI with control: versioned logic, measurable safeguards, meaningful human review, and decision artifacts that stand up under scrutiny.

In health insurance, speed is not the goal.

Speed with proof is the goal.

Stanford Sets the Bar for Explainable AI in Health Insurance

What the Stanford Brief Gets Exactly Right

Use AI to Approve Faster, Not Deny Faster