Introduction

TODO: 2-3 paragraph introduction explaining what the ML Deployment Checklist is and when to use it. Cover the core problem: the gap between a working ML notebook and a production-ready model is wider than most teams expect. Infrastructure, monitoring, rollback, and governance complexity stops 90% of models from shipping. This checklist closes that gap.

TODO: Second paragraph covering the checklist's structure — organized into six phases: model validation, infrastructure readiness, monitoring setup, rollback planning, governance sign-off, and launch. Each phase has binary pass/fail criteria. A model cannot proceed to the next phase until all criteria in the current phase are met.

TODO: Third paragraph on how to use this checklist — as a pre-launch gate in your CI/CD pipeline, as a team review document, or as a formal sign-off process before production deployment. Xephyr uses this checklist in every ML engagement to ensure models are operationally ready before deployment reviews begin.

Phase 1: Model Validation

TODO: Overview of Phase 1 — verifying that the model performs reliably before any production infrastructure work begins. This phase catches problems that would be expensive to discover after deployment.

TODO: Held-out test set performance meets the minimum threshold defined at project start — not just validation set performance
TODO: Model performance tested on time-based splits, not just random splits — temporal leakage identified and eliminated
TODO: Edge cases and adversarial inputs tested — model behavior on out-of-distribution inputs documented
TODO: Bias and fairness evaluation completed for all relevant demographic dimensions
TODO: Model explainability verified — decision logic explainable to business stakeholders in non-technical terms
TODO: Baseline comparison documented — model outperforms current decision process or simple heuristic by defined margin

Phase 2: Data Pipeline Validation

TODO: Overview of Phase 2 — verifying that the data feeding the model in production matches what was used in training, and that the pipeline is reliable enough for production.

Validate Feature Engineering Pipeline

TODO: Every feature transformation applied during training must be replicated exactly in the production pipeline. Cover how to test for training-serving skew, how to use feature stores to ensure consistency, and what to do when skew is detected after deployment.

Test Data Freshness and SLAs

TODO: The model's performance depends on receiving fresh data on schedule. Define maximum acceptable data latency for each feature. Test pipeline behavior when upstream sources are delayed or unavailable — does the model fail gracefully or serve stale predictions?

Validate Schema Handling

TODO: Production data schemas change. Verify that the pipeline handles: new columns (should be ignored, not cause errors), missing columns (fail loudly with context, not silently produce null predictions), and type changes (coerce where safe, reject where not).

Phase 3: Infrastructure Readiness

TODO: Overview of Phase 3 — verifying that the serving infrastructure can handle production load, is observable, and integrates correctly with consuming systems.

TODO: Load testing completed — model serving handles peak traffic with acceptable latency at defined SLA
TODO: Memory and CPU resource limits set appropriately — no unbounded memory growth under sustained load
TODO: Health check endpoint implemented and returning meaningful status (not just 200 OK for any state)
TODO: Deployment pipeline tested end-to-end in staging environment with production-representative data volume
TODO: Model versioning implemented — old and new model versions can run simultaneously for A/B testing or shadow mode
TODO: Autoscaling configured and tested — system scales up under load without manual intervention

Phase 4: Monitoring and Alerting

TODO: Overview of Phase 4 — verifying that sufficient monitoring is in place to detect model degradation, data quality issues, and infrastructure failures before they affect business outcomes.

Set Up Prediction Monitoring

TODO: Define the prediction distribution you expect to see in production. Configure alerts for: distribution shift (prediction scores drifting from expected range), volume anomalies (significantly more or fewer predictions than expected), and null prediction rates (model returning null/default for unexpected proportion of inputs).

Configure Data Drift Detection

TODO: Input feature distributions change over time, degrading model performance before any outcome labels are available. Cover how to implement reference distribution tracking, statistical tests for drift detection, and threshold-based alerting for actionable drift signals.

Wire Outcome Monitoring

TODO: Where outcomes are observable with acceptable latency, wire actual vs. predicted comparison. Define how model performance will be measured in production, how frequently it will be evaluated, and what performance threshold triggers a retraining or rollback decision.

Phase 5: Rollback and Recovery

TODO: Overview of Phase 5 — verifying that the team can safely revert to the previous model version or to a rule-based fallback if the deployed model causes harm.

TODO: Previous model version still available and deployable in under 15 minutes
TODO: Rollback procedure documented, tested in staging, and known to the on-call engineer
TODO: Rule-based fallback defined for cases where all model versions fail — business can operate without ML predictions
TODO: Downstream system behavior tested when model is unavailable — graceful degradation verified
TODO: Rollback decision criteria defined — what specific signals trigger an immediate rollback vs. a monitored investigation?
TODO: Post-rollback review process defined — how will root cause be identified and what will prevent recurrence?

Phase 6: Governance and Sign-Off

TODO: Overview of Phase 6 — the human review and approval process before go-live. This phase ensures that the right stakeholders have reviewed the model and accepted its risk profile.

Business Stakeholder Review

TODO: The model owner and primary business stakeholders must review and sign off on: model behavior on business-representative test cases, edge case handling, known limitations and failure modes, and monitoring and escalation procedures.

Legal and Compliance Review

TODO: For models affecting regulated decisions (credit, hiring, insurance, healthcare), legal and compliance review is mandatory before launch. Cover what documentation is required, what questions legal typically asks, and how to prepare the model card and bias evaluation report.

Maturity Levels

Level 1Ad Hoc

TODO: Models deployed manually via Jupyter notebook or direct file copy. No version control, no monitoring, no rollback capability. Deployment is a heroic individual effort each time.

Level 2Basic CI/CD

TODO: Models deployed via a basic CI/CD pipeline with automated testing. Version tracking in place. Basic endpoint health checks. No data drift monitoring or automated retraining.

Level 3Production-Ready

TODO: Full pre-deployment checklist enforced as CI gate. Prediction monitoring and data drift detection active. Documented rollback procedure tested. Business sign-off required before launch.

Level 4MLOps Mature

TODO: Automated retraining triggered by drift detection or performance degradation. Shadow deployment for new models before traffic cutover. A/B testing framework for model comparison. Outcome monitoring with feedback loops.

Level 5Self-Healing

TODO: Models automatically retrain, evaluate, and promote themselves based on performance signals. Human review reserved for significant behavioral changes or high-risk decisions. ML platform team manages infrastructure; feature teams own model quality.

ML Deployment Checklist