Today's submission queue (synthetic)
| Case | Drug · Payer | Denial risk | Top driver | Recommended pre-flight action |
|---|
Risk scores are calibrated probabilities, not rankings — a 0.72 means 72 of 100 such cases historically bounced. Calibration is what lets you allocate finite reviewer time rationally instead of chasing a leaderboard.
Calibration (reliability curve)
Predicted denial probability (deciles) vs. observed denial rate. The senior framing: a sharp-but-miscalibrated model misallocates human review; a calibrated one converts directly into staffing math.
Model card (what production must handle)
- Decision first: the model exists to trigger an action — fix documentation pre-submission, or route to human review. Output is a work queue, not a score.
- Features: payer × plan × drug approval history, step-therapy documentation status (NLP-extracted from clinical notes), required-lab presence, prescriber historical approval rate, diagnosis–criteria match.
- Label hygiene: overturned-on-appeal denials are re-labeled; payer criteria are versioned because rules shift quarterly and stale labels poison the model.
- The feedback loop: if the model fixes submissions, observed denials fall and the training signal erodes — maintained holdout traffic and importance weighting are in the design, not bolted on.
- Clinical NLP core: the highest-value features come from unstructured notes — symptoms, prior medications, failed therapies. Extraction quality is the model's ceiling.
One-pager: the first-pass approval machine
Every denial costs an appeal cycle (days to weeks), reviewer labor, and patient risk — and most denials are administrative, not clinical: a missing lab, undocumented step therapy, a criteria mismatch the note actually satisfies but the form doesn't show. Prevention is strictly cheaper than appeal, and it moves the headline time-to-therapy metric directly.
- A triage queue ranked by calibrated denial probability, each case carrying its top risk driver and a concrete pre-flight action.
- The ops translation: flags cleared → projected first-pass lift → appeal-hours avoided → patient-days saved. Model quality expressed in the units the CEO runs the company on.
- A model card that takes the hard parts seriously: label hygiene under appeals, quarterly payer-rule drift, and the self-eroding feedback loop.
The score runs at submission-assembly time inside the existing workflow; flagged cases route to a fix-it queue with the missing artifact named. Holdout traffic preserves the training signal. The same NLP extraction layer that powers the features also powers the documentation-completeness pre-check — one investment, two products.
My ML research career started on exactly this substrate: extracting medications, symptoms, and side effects from scanned psychiatric notes at CAMH (Canada's largest mental-health hospital), then an MSc capstone on clinical NLP at the University of Toronto. Since then: privacy-safe ML data infrastructure for LLM training at Meta, and Head of AI at a digital-health startup building therapeutic conversational AI. Unstructured clinical text → reliable structured decisions is my home turf. — Jeff Pinto · jeff@jeffpinto.com · jeffpinto.com