Connect → clean → model: data orchestration patterns we ship with.
Idempotent handlers, at-least-once delivery, dead-letter routing, back-pressure, schema evolution, and pipeline observability — the recurring patterns that show up in every data pipeline we ship into production. The orchestrator is downstream of these.
Almost every data system we ship has the same shape underneath: connect (pull from somewhere), clean (validate, normalize, enrich), model (do the analytical or ML work), publish (write to wherever the result is consumed). Each arrow between those stages is a place where the system meets the network — and meets all the failure modes that come with it.
The patterns below recur at every arrow. They are not the orchestrator. The orchestrator is what you point at the patterns to schedule them. Picking the right orchestrator is a real decision, but it's downstream of getting these right. Pick the wrong patterns and the orchestrator's retry feature turns into a duplication engine. Pick the right patterns and most orchestrators are interchangeable.
This post is the conceptual half. The companion piece compares the orchestrators themselves at production scale.
Idempotent handlers
Re-running the same task with the same input produces the same effect on the world. That is what idempotent means here. It is the single most load-bearing property of a production data pipeline.
The reason is that every pipeline retries. The retry might be in the orchestrator, in the queue, in the worker, in the network library, in the human operator manually clicking "rerun." You cannot prevent retries; you can only design for them.
What idempotency looks like in practice depends on the side effect:
- Database writes. Use upserts (
INSERT ... ON CONFLICT DO UPDATE,MERGE, etc.) keyed on a deterministic ID derived from the input. Re-running the task overwrites the same row rather than inserting a duplicate. Where upserts aren't available, use a dedup table — record(idempotency_key, completed_at)and check before writing. - Object storage writes. Content-addressed paths. The output filename is
{deterministic_hash}.parquet, not{timestamp}-{random}.parquet. Re-running writes to the same path; downstream consumers don't see a duplicate. - External API calls. An idempotency key passed in the request header (the modern API standard). The provider dedups on their side. Stripe, Square, and most modern payment APIs do this; non-payment APIs increasingly do too.
- Message publishing. A deterministic message ID, plus a consumer that dedups on the ID. (See at-least-once below.)
The discipline is to think about what the side effect is before writing the handler. If you can't articulate what the side effect is, you can't make it idempotent.
The cost of getting this wrong is invisible during development and devastating in production. Duplicate financial entries. Double-counted analytics rows. Two emails sent to the same customer. Each one is the kind of bug that survives a quarter before someone notices, and then everyone notices at once.
At-least-once delivery, and what it implies
The default delivery semantics of every modern message queue — Kafka, RabbitMQ, SQS, NATS, Pub/Sub — is at-least-once. The queue guarantees that every message will be delivered to a consumer at least one time. It does not guarantee that every message will be delivered at most one time. Duplicates are part of the contract.
Exactly-once delivery exists in some queues as a feature, but it is expensive (extra round-trips, broker-side state) and brittle in failure modes that span beyond the queue (a broker can be exactly-once internally and the consumer can still process the message twice if it crashes after acking). The conventional wisdom — and the one we ship with — is to assume at-least-once and combine it with idempotent consumers. The result is effectively exactly-once at the system level, without paying the broker-side cost.
The pattern:
- The producer writes a message with a deterministic message ID (derived from the source event, not from the producer's clock).
- The consumer dedups on message ID — a small dedup table or, for high-throughput systems, a Bloom filter backed by a periodic compacted log.
- Side effects within the consumer are idempotent (per the previous section), so even if dedup misses for some reason, the system stays correct.
The two layers of defense are not redundant. Dedup catches the common case efficiently. Idempotency catches the case dedup misses (a stale dedup table, a failover that loses dedup state). Together they get you a system that survives broker hiccups, consumer crashes, and operator reruns without corrupting state.
Dead-letter routing
Sooner or later, a message fails to process and continues to fail on every retry. The data is malformed. The downstream is permanently down. The schema is incompatible. Whatever the reason, the message cannot succeed in its current form, and you have two bad options if you do nothing: it blocks the queue forever, or you start dropping it silently.
Dead-letter routing is the third option. After N failed attempts, the message is moved to a dead-letter queue (DLQ) with full context — the original payload, the error, the stack trace, the attempt count, the timestamp. The main queue keeps flowing. The DLQ is monitored, and either a human or an automated remediation system handles the failures.
Operational discipline:
- Alarm on DLQ depth. Zero is the steady state; non-zero is a problem.
- Triaging the DLQ is a real engineering task, not a low-priority chore. The patterns in the DLQ tell you something is broken.
- Reprocessing from the DLQ is a first-class operation. Once the underlying issue is fixed (a producer schema change rolled back, a downstream service restored), DLQ messages get replayed — through the same idempotent handlers, so reprocessing is safe.
Without a DLQ, you choose between a stuck queue and silent data loss. Both are operational disasters. With one, you get a triageable signal that something specific is wrong.
Back-pressure and circuit breakers
When downstream is slower than upstream — even briefly — the queue between them grows. If the queue grows faster than it drains, end-to-end latency goes to infinity, then alerts fire, then the system is functionally down. The cause is rarely visible without queue-depth instrumentation.
Two patterns prevent the cascade:
- Back-pressure. The queue has a bound. When it fills, producers either block (slowing themselves down) or shed (refusing low-priority work). The system as a whole degrades smoothly under overload rather than catastrophically. The right bound is one you can hit during normal bursts and recover from quickly — not infinity.
- Circuit breakers. When downstream errors exceed a threshold, the consumer stops calling downstream entirely. Requests fail fast — usually returning a graceful default or an explicit "try later" response — until a periodic probe shows the downstream is healthy again. The breaker prevents a degraded downstream from turning every consumer into a slow consumer.
Both are well-supported in mature service meshes and HTTP client libraries (Hystrix and its descendants, the standard Go and Rust patterns, Linkerd / Istio at the mesh layer). The choice is not whether to use them but where. Every dependency that can be slow should be wrapped.
Schema evolution
The producer's payload shape changes. The consumer was built for the old shape. If the consumer crashes on the new shape, you have an outage. If the consumer silently coerces the new shape, you have wrong data forever.
The patterns we use:
- A schema registry. The producer registers the schema of every message type. Consumers fetch and cache. The registry enforces a compatibility rule on every change.
- Compatibility rules. Backward compatibility means new consumers can read old messages. Forward compatibility means old consumers can read new messages. Full compatibility is both. For most pipelines we want forward compatibility — the producer can roll out a schema change and consumers don't break, even if they're slow to upgrade.
- Versioned messages. Every message carries a
schema_versionfield. Consumers can branch on it for backward-incompatible changes that can't be avoided. - Validation at the boundary. The consumer validates the incoming message against the schema before doing anything else. Invalid messages go to the DLQ with a clear error, not into the handler where they cause cryptic downstream failures.
The discipline applies inside the system as well as at the edges. Database schemas evolve too; the same compatibility rules apply, with online migrations and dual-writes as the deploy pattern.
Pipeline observability
Three layers, each catching a different class of failure.
- Per-task metrics. Success rate, failure rate, duration, retry count, last successful run. Standard SRE metrics, scoped per pipeline task. This is what tells you a task is failing.
- Pipeline-level metrics. End-to-end latency (when did the source event happen vs. when did the result land), throughput (messages per minute), lag (how far behind real-time is the pipeline). This is what tells you the pipeline as a whole is healthy or stuck. A pipeline can have every task green and still be hours behind because of a backlog.
- Data-quality metrics. Row counts per partition, null rates per column, distribution of key fields, referential integrity counts. This is what tells you the data is right — that the pipeline is not silently producing garbage. Most pipelines fail this check long before they fail the per-task check.
Without all three, the failure mode is the same: a downstream consumer notices something is wrong before the pipeline team does. With all three, the team has a chance of catching it first.
The orchestrator is the thinnest layer
Pick Airflow, Prefect, Kestra, Dagster, Argo, Temporal — they all give you scheduling, retries, dependency graphs, and a UI. They differ in dynamism, in how they handle data passing, in their concurrency models, in their operational ergonomics. Those differences matter, and they're worth thinking through. (The companion post does that comparison in detail.)
But the orchestrator does not give you idempotency. It does not give you schema discipline. It does not give you data-quality observability. Those have to be in the handler — in the code your tasks actually execute.
A pipeline is a chain of handlers connected by an orchestrator. The handlers are the load-bearing piece. Get the patterns above right at the handler level, and the orchestrator is mostly a scheduling concern. Get them wrong, and no orchestrator will save you.