§ Writing

ML in production: the gap between a trained model and a thing customers depend on.

Training a good model is one thing; running it as a system real customers depend on is another, and the gap is mostly the unglamorous engineering that keeps it shipped — idempotency, queue discipline, observability, schema evolution, and deploys that don't drop users.

23 April 2026by Bogdan#ml#production#engineering

There's a phrase that gets used loosely: "the model is in production."

In a notebook on a researcher's laptop, the model returns a prediction when called, and the prediction looks reasonable, and the F1 is good. In a serving framework, the model returns a prediction when called over HTTP, and the latency is fine when one person hits it, and the team is pleased. Both of these are sometimes called "in production." Neither of them is.

A model is in production when its outputs cause effects on the operation. When a customer's invoice changes because of it. When a trade gets sized differently because of it. When a technician gets dispatched or not because of it. When a piece of content gets shown or hidden because of it. When the model goes down or goes wrong, something measurable changes for someone who didn't ask for it to change.

Every additional thing that's true about a system in that sense — multiple users, real load, retries, deploys, schema drift, on-call — adds engineering to the model's surroundings. By the time the model is running for a real operation, the engineering surroundings are most of the work, and the model itself is a relatively small part of the software you ship. Almost all of the failure modes that make ML in production hard are not modeling failures. They are systems-engineering failures with a model sitting in the middle of them.

This is the unfashionable thing to say in a moment when shipping something is easier than ever. Code generation has compressed the time from idea to "it runs." It has not compressed the time from "it runs" to "it stays running for the customers who are paying for it." The asymmetry between those two milestones is roughly where Oriented Platforms operates.

What follows is a partial list of the surroundings. None of it is the model.

Async correctness

Most production ML inference is not synchronous request-response. It is a job that gets enqueued, processed, and whose result gets delivered somewhere else — to a database, to another service, to a downstream model in a chain, to an email. The transition from "the model returns a prediction" to "the prediction reaches the place where it matters" is its own engineering surface, and most of it has nothing to do with the model.

The first question that matters: where does the response go when the model takes thirty seconds, and the user has already navigated away? The second: what happens when two requests for the same input arrive within a second of each other — does the model run twice? The third: if the downstream system is briefly down, does the prediction get dropped, queued, or retried indefinitely?

These are the questions that get hand-waved in a notebook and that determine whether the system is correct in practice. The default behavior of "call function, get result" is wrong for a system that has to survive partial failures, and replacing it with the right behavior is half the engineering work of getting an ML system into production.

Idempotency

Networks fail. Workers crash. Deploys interrupt jobs mid-processing. Every distributed system has retries somewhere — at the load balancer, at the queue, at the client, at the orchestrator. If retries reach a handler that isn't idempotent, the side effects compound. Predictions get logged twice. Customer-visible counters get incremented twice. Bills get charged twice.

The fix is well-understood and almost never the first thing built. Every inference job carries an idempotency key — usually a hash of the inputs plus a request ID — and every side effect is gated on the key not having been seen before. The implementation is some combination of a dedup table, a SELECT ... FOR UPDATE, an upsert, or a message-broker feature. The shape varies; the discipline does not.

Idempotency is the thing you don't notice when it's working and that destroys customer trust the first time it isn't. "Why did I get billed three times for one prediction" is unforgivable in a way that "the prediction was slightly wrong" is not.

Queue saturation under burst

Arrival rate is never constant. The number of requests per second the system has to handle right now is not the average; it is the burst. Black Friday hits a retail-adjacent system. A backfill of historical data hits an analytics pipeline. A viral moment hits a recommendation system. A market move hits a trading-adjacent system.

If the queue grows faster than the workers drain it, latency goes to infinity. The first symptom is "requests are slower than usual." The second is "alerts are firing for stuck jobs." The third is "the system is functionally down." These all look the same from the outside — the system is unresponsive — and the cause is invisible without queue-depth metrics.

The defenses are unglamorous and well-known: backpressure (refuse new work when the queue is too full), autoscaling (add workers under load), load shedding (drop low-priority work to keep the high-priority work flowing), priority queues (hot tenants vs. backfills), and circuit breakers (stop calling downstream services that are themselves saturated). The design choices are operation-specific. The need for them is not.

Observability

You cannot fix what you cannot see. ML systems in production have at least three observability layers, and most teams build only the first one before something else forces them to build the others.

System performance. Latency, error rate, queue depth, worker utilization, retry counts. Standard SRE metrics with the model as a black box. This is what tells you the system is up.
Model performance. Are the predictions actually right? You almost never have ground truth at inference time, but you usually have it a day or a week later — when the click happens, when the trade settles, when the technician confirms. The pipeline that pulls in delayed labels and computes accuracy/AUC/whatever-matters over a rolling window is the thing that catches a model that has quietly gotten worse.
Data drift. The input distribution moves. The model was trained on the world as it was; the world isn't that anymore. Drift detection is unsexy statistics — KL divergence, PSI, a feature-by-feature distribution check — running continuously against the training distribution. When drift trips a threshold, the on-call doesn't fire a page; the model team gets a ticket.

Without all three, the failure mode is invariably the same: a customer notices the model is wrong before the team does. That's a credibility hit it takes months to recover from.

Schema evolution

The model was trained on the input shape that existed at training time. The input shape is going to change. A new column gets added upstream. A previously-required field becomes optional. An enum gets a new value the model has never seen. The product team renames a feature.

If the model receives an unexpected shape and silently coerces it — fills with zero, drops the field, NaN-propagates — the predictions degrade in a way that is invisible until someone goes looking. "The model is performing worse" is a vague enough complaint that it can survive a quarter without a root cause.

The defenses again: schema validation at the inference boundary (reject inputs that don't match the contract), versioned schemas (the model declares which schema it expects, and the contract is enforced), and a shadow path for the new schema (run new inputs through the new schema's model in shadow mode, compare results, switch over only when the comparison is good).

A model trained on schema v3 should not silently accept schema v4 inputs. It should refuse, or be replaced.

Deploys without dropping users

A model in production is a binary artifact — weights, sometimes a tokenizer or feature transformer, a few config parameters — but the deploy is more than file replacement. There are in-flight requests on the old model when the new one rolls out. There are queued jobs that were enqueued under v1 and will be processed under v2 if you're not careful. There is a window where v1 and v2 are both serving, and they need to either disagree gracefully or not at all.

The shapes that work:

Shadow deployment. v2 runs alongside v1 for a period; both produce predictions, only v1's are used, the comparison gets logged. After a confidence period, switch.
Canary. v2 takes a small fraction of traffic, the model performance metrics are watched closely, the fraction grows over hours or days, then 100%.
Versioned predictions. Every stored prediction carries the model version that produced it. When something looks weird in retrospect, you can attribute it to the right model.
Rollback paths. The previous model is one config flip away. "Rolling back is a deploy" is the failure mode you want to avoid.

The design decisions vary by operation. The need to design them deliberately does not. "Push to prod" is not a deployment strategy for a model.

What "shipped and stays shipped" looks like

A model that's shipped and stays shipped has all of the above wired up before customers depend on it. It is not the model with the highest leaderboard score. It is the model with the most boring failure modes — the ones you've already prepared for.

The gap between "trained a model" and "shipped a model that customers depend on" is the gap between a notebook and a system. The notebook part is usually the more enjoyable work. The system part is the work that determines whether the project ships value to anyone other than the team that built it.

When clients ask us why an ML project is taking longer than they expected, the answer is usually some version of this: the modeling was the cheap part, and the work to get it shipped is the expensive part, and the work to keep it shipped is most of what we're doing here. That ratio is not a sign of a bad project. It is a sign of a serious one.

The model is the smallest part. The discipline around it is most of what "in production" means.