Airflow vs. Prefect vs. Kestra at production scale.
Three orchestrators we've shipped with, the operational shapes each fits, and the production gotchas that don't show up in benchmarks. Companion to the data-orchestration-patterns post.
The companion post argued that the orchestrator is the thinnest layer of a data pipeline — the patterns under it (idempotency, dead-letter routing, back-pressure, schema discipline, observability) matter more than the tool that schedules them. This note is the orchestrator comparison itself. The three we ship with most often: Airflow, Prefect, Kestra.
The aim is not a feature matrix. It's an honest read of what each one is good at, where each one strains, and how to pick between them when the patterns are already in place.
At a glance
| Tool | Strengths | When we reach for it | Gotcha at scale |
|---|---|---|---|
| Airflow | Mature, large ecosystem, managed offerings on every cloud, deep operator catalog. | Existing infra already has it; batch-heavy ETL; the team has Airflow muscle memory. | Scheduler latency, executor choice, metadata-DB load — all manageable, none free. |
| Prefect | Pythonic flows-as-functions, dynamic mapping, cleaner local dev, good observability. | Greenfield Python pipelines, mixed batch/streaming, teams that value a Pythonic surface. | Self-hosting takes more work than Cloud; Cloud cost grows fast at high run volume. |
| Kestra | Declarative YAML, event-driven first-class, strong integration plugins, UI-forward. | When triggers and integrations matter; non-Python-fluent teams; YAML-as-source-of-truth. | Smaller community, plugin maturity uneven for niche systems, breaking changes still real. |
Airflow — the default for a reason
The default isn't a derogation. Airflow has the largest ecosystem of any orchestrator, every major cloud offers a managed version (MWAA, Cloud Composer, Astronomer's offering wrapping vanilla Airflow), the operator catalog covers nearly every system you might want to talk to, and there's a decade of accumulated production experience in the public corpus. The job postings agree.
Where it earns its keep: existing infrastructure where Airflow is already running, batch-heavy ETL with stable cadences (hourly, daily, weekly), and teams whose mental model already speaks DAG / operator / sensor.
Where it strains: the scheduler is the historical bottleneck. Airflow 2.x improved this dramatically and 3.x continues to; even so, scheduler latency, executor choice (Celery vs. Kubernetes vs. local), and metadata-database load are real production concerns at high task volume. Dynamic DAG construction works but is awkward — the pattern was bolted onto a static-DAG world. The operator/sensor model encourages a coding style where business logic leaks into Airflow primitives, which makes the pipelines hard to test outside Airflow.
The honest read: if you're greenfield and Python-native, you probably don't need Airflow. If you're integrating into a stack that already has it, fighting it is expensive and rarely justified.
Prefect — Pythonic by design
Prefect's value proposition is a much cleaner Python surface. Flows are functions, tasks are functions, the orchestration is data flowing through Python rather than through XComs. Local development is straightforward — prefect.run_local() and your flow runs the same way it will in production. Dynamic flows, mapping, and conditional branches are first-class rather than retrofitted.
Where it earns its keep: greenfield pipelines in Python, mixed batch/streaming workloads (the agent/work-pool model handles both), teams that read def more naturally than YAML, and projects that benefit from running the same flow code locally and in cloud.
Where it strains: self-hosting Prefect Server is real work — the happy path is Prefect Cloud, and Cloud cost scales with run volume in a way that surprises teams that didn't price it in. The 2.x to 3.x transition (and the earlier 1.x to 2.x rewrite before it) means the public corpus is fragmented across versions, and answers from a year ago might be wrong now. The ecosystem is smaller than Airflow's; some integrations you'd reach for an Airflow operator for don't exist as Prefect blocks and you write them yourself.
The honest read: if your team is Python-fluent and the pipeline is greenfield, Prefect is usually the more pleasant tool day-to-day. The operational tradeoffs (self-hosting effort, Cloud cost) are the questions to ask before committing.
Kestra — declarative and event-driven
Kestra is the newest of the three in the production space. The defining choice is YAML-first: pipelines are declarative documents, not Python code. Event-driven triggers are first-class — Kafka, webhooks, file drops, schedule, and a flow trigger model that lets one pipeline cleanly invoke another. The UI is the most complete of the three out of the box, particularly for non-engineer stakeholders who want to see what's running.
Where it earns its keep: pipelines where triggers and integrations dominate (event-driven workloads, multi-system orchestration), teams that aren't all Python-fluent (analytics engineers, data engineers from a JVM background, ops staff who want to read the pipeline without reading code), and shops where YAML-as-source-of-truth is a desired property — easy to version, easy to review, easy to generate.
Where it strains: the community is smaller, so plugin maturity for niche systems is uneven (the popular ones are excellent; the long tail is not). Breaking changes still happen on minor versions in ways that mature Airflow installs would not tolerate. The Pebble templating language is powerful but adds a second mental model on top of YAML, and complex flows can develop a feeling that a real programming language would have been simpler.
The honest read: when the workload is event-shaped and integration-heavy, Kestra's design has fewer rough edges than the alternatives. When the workload is pure Python data processing, the YAML surface adds friction.
At production scale: where the gotchas live
For all three, the production failure modes are not "the tool doesn't work." They are operational details that benchmarks don't surface.
- Airflow. Scheduler latency is real — under load, tasks that should start at
Tstart atT + 30sor worse. Mitigations: scheduler HA (multiple scheduler processes), tuningmin_file_process_interval, switching to the Kubernetes executor for spike workloads. The metadata DB is the second bottleneck; partitioning historical task instance rows or aggressively trimming history matters at high volume. Logs: pick a remote-log handler (S3, GCS, ELK) early — the local-disk default does not survive any real install. - Prefect. Concurrency limits are on flows and on tasks; misconfigured, you get either thundering herds or starvation. Work pools want to be carefully sized to the worker fleet. Prefect Cloud bills on flow runs and task runs both — high-cardinality task fan-out can balloon the bill. Self-hosted: the Postgres backing store wants to be tuned the same way Airflow's does.
- Kestra. The internal storage backend (where flow state and intermediate data live) is the operational lever — local filesystem is a starter only, S3 / GCS / Azure Blob is the production answer. Plugin versions need to be pinned; auto-updates have bitten installs. Schema changes between minor versions are sometimes manual.
The pick is downstream of the patterns
If the patterns under the orchestrator are right — idempotent handlers, at-least-once + dedup, dead-letter routing, schema discipline, observability — all three of these work. The choice is operational, not architectural. Pick the one whose strengths match your operation, whose gotchas you can absorb, and whose ecosystem your team can navigate.
If the patterns are wrong, no orchestrator saves you. Airflow's retries become a duplication engine. Prefect's mapping turns into a fan-out catastrophe. Kestra's triggers fire on events that should never have been emitted. Fix the patterns; then the orchestrator is a tool decision, not a bet.