Trustworthy Payment Integrity = AI Speed + Human Judgment

Why? Because PI is a socio-technical problem. The provider payer claims, and payment workflow sits at the intersection of clinical nuance, contractual detail, regulatory rules, and human judgment.

Models can identify and rank issues, but only experts can determine why they are wrong, how to correct them, and what changes are needed upstream to prevent recurrence.

The organizations making real gains are designing AI + HI - AI with Human Intelligence - experts in the loop - on purpose.

The stakes: leakage is persistent, and oversight is rising

Improper payments remain material. CMS’s CERT data put the 2024 Medicare FFS improper payment rate at 7.66% ($31.7B), a flat trend line that underscores how complex and challenging this problem is, even with decades of edits and audits.

Across the federal government, GAO reports $162B in estimated improper payments in FY2024, with the lion’s share as overpayments, keeping payment integrity in the policy spotlight.

At the same time, appeals data show a trust gap in utilization management. In Medicare Advantage, only ~10–12% of denials are appealed, yet over 80% of appealed decisions are overturned, a signal that first decisions often lack sufficient context or explanation.

If we want audits providers accept—and denials that don’t boomerang—our systems must explain and withstand scrutiny.

Why “AI-only” doesn't work in PI

Ambiguity is the rule, not the exception. Claims reflect messy, real-world care. Policy text is nuanced; contracts are idiosyncratic. Scores without clinical and contractual rationales will spike friction and appeals.
Regulatory gravity. Healthcare is rapidly codifying expectations for transparency and oversight in predictive systems (ONC’s HTI-1 algorithm transparency provisions; FDA’s transparency principles for ML-enabled devices; NIST’s AI Risk Management Framework). If you can’t show intended use, data lineage, risks, controls, and governance, your AI won’t clear committees.
Provider abrasion is real. Industry surveys and trade reporting repeatedly flag abrasion as a byproduct of blunt PI interventions—and a barrier to long-term collaboration. The pattern: low-context findings → high overturn → lower trust.
Clinical trust is earned, not inferred. Reviews in the medical literature show that decision support improves quality when clinicians can understand, critique, and override recommendations. In other words, explainable and contestable beats inscrutable.

What works: AI + HI by design

Think of AI + HI as co-reasoning: models surface patterns; humans validate, enrich, and operationalize. Done right, you get four compounding advantages:

Higher precision (fewer false positives) because expert review trains models on edge cases.
Lower abrasion because every decision ships with provenance (policy clause, contract term, edit table, effective date).
Faster prevention because validated findings lead to pre-pay rules you can defend.
Better governance because human checkpoints and audit trails align with regulators’ expectations for transparent, controllable AI.

A practical blueprint for PI leaders

Explainability as Architecture, Not Afterthought

Every model output should include the reason for the decision: policy citation, contract clause, code edit, effective dates, and evidence trails. This is not paperwork; it’s the substrate of trust—and the most effective antidote to abrasion. It also answers the regulator’s requirement of “how does it work and when is it appropriate?”

Expert Eyes on High-Impact Calls

These three steps at a minimum require human reviews and expert guidance:

Triage review for high-impact flags (e.g., DRG, high-dollar, frequent outliers).
Clinical/contractual adjudication for contested cases.
Promotion council that converts post-pay evidence into pre-pay rules with version control.

This “graduated oversight” keeps throughput high while ensuring hard calls get expert eyes. Reviews in informatics and bioethics literature argue explicitly for this calibrated human role.

Continuous Learning or Continuous Drift

Ship continuous learning loops: capture reviewer dispositions (“agree,” “agree with changes,” “reject”), capture why, and feed back into model retraining. Retire weak features; elevate strong ones. Publish drift dashboards so everyone—medical policy, SIU, finance—sees whether precision is improving.

Provider-Ready, Audit-Ready, Always

For each finding, auto-assemble a packet: claim lines involved; rules and references; medical necessity logic where applicable; timeline of updates (policy version, NCCI/MUE table version, contract effective dates). Plans that communicate clearly see better cooperation and fewer escalations. (Multiple industry sources highlight education and transparency as levers to reduce abrasion.)

Secure by Default, Compliant by Design

Map NIST AI RMF functions (Govern, Map, Measure, Manage) to your PI workflow: risk registers for models, bias testing, role-based overrides, and incident response when performance drifts. Align with HHS/ONC transparency requirements and FDA transparency principles to keep cross-functional teams (security, compliance, medical policy) on the same page.

From Recovery to Prevention

If your post-pay program isn’t systematically promoting validated issues to pre-pay controls—backed by provenance—you’re paying for the same error twice. A disciplined AI + HI loop should:

Detect (AI flags ranked by impact × confidence).
Validate (expert review with policy/contract rationale).
Packetize (provider-ready evidence).
Promote (governed pre-pay edit with effective dates and rollout plan).
Monitor (track acceptance, overturn, and yield; roll back rapidly if signal degrades).

Tie incentives to reduce repeat variance, not just recovering dollars. RAC programs and appeals history show that raw detection without context doesn’t travel well, and defensibility matters.

Measure What Matters

Beyond standard pipeline and recoveries, create the right metrics to measure:

Acceptance rate (provider acceptance without appeal) and overturn rate (by reason).
Packet completeness score (did we include the correct citations, effective-date diffs, and reasoning?).
Pre-pay promotion rate (share of validated post-pay issues that become durable pre-pay controls).
Time-to-value (flag → validated → dollars realized).
Abrasion indicators (provider call volume, repeat dispute categories).

On utilization management adjacencies, monitor appeal success closely. When >80% of appealed MA denials are overturned, the signal is unmistakable: explanations—and upstream rules—must improve.

Responsible AI

Regulators aren’t prescribing model architecture; they’re asking for control: clarity on intended use, transparency on how outputs are produced, and guardrails for monitoring and change.

The NIST AI RMF provides a widely adopted approach to operationalize this (governance, mapping risk, measuring performance, managing change). ONC’s HTI-1 adds algorithm transparency expectations to certified health IT, and the FDA’s transparency principles emphasize explainability and performance monitoring for safety. Build these into your PI platform from day one; don’t bolt them on.

There’s a deeper reason to invest here: clinicians trust tools that augment judgment and make reasoning visible.

AI that complements clinical cognition rather than trying to replace it, precisely the mindset PI needs.

Nedl Labs delivers AI-native payment integrity: provenance on every recommendation, expert checkpoints, provider-ready packets, and a drift ledger that promotes post-pay learnings to pre-pay controls. Our workflows integrate with your systems and can be intercepted for human review wherever judgment matters.