Every pipeline builder eventually hits a wall: your job fails halfway through, you rerun it, and suddenly the data warehouse has duplicate rows. Or you parallelize processing to meet a deadline, only to find that two tasks overwrite each other's results. This is the sultry calculus of idempotency versus concurrency—two design goals that pull in opposite directions. Idempotency says “running the same work twice is safe.” Concurrency says “run as much as possible at once.” Reconciling them is not a one-time architectural choice; it's an ongoing negotiation between correctness and speed.
This guide is for engineers and architects who design batch or streaming pipelines and have felt the pain of non-deterministic retries. We'll define the conflict, walk through how it manifests under the hood, and provide a decision framework you can apply to your next pipeline.
Why This Conflict Matters Now
Data pipelines are no longer simple cron jobs that move CSV files. Modern pipelines must handle late-arriving data, reprocess historical records, and scale across clusters—all while maintaining a single source of truth. The stakes are high: a non-idempotent pipeline can silently corrupt reports, skew machine learning training data, or double-bill customers.
Concurrency, meanwhile, has become a default expectation. Teams reach for parallel execution to reduce runtime from hours to minutes. But concurrency introduces race conditions, ordering problems, and partial failures. When a concurrent job fails midway and you retry it, the interaction between retries and parallelism can amplify errors. The result is a system that is neither fast nor correct.
Real-World Triggers
Three common scenarios force this trade-off to the surface:
- Retry-heavy environments: Cloud services are unreliable by design. Transient failures are normal. If your pipeline isn't idempotent, every retry risks duplication.
- Multi-tenant pipelines: Different tenants may share compute resources. Concurrency is necessary to isolate workloads, but idempotency must be preserved per tenant.
- Streaming with batch fallback: Systems that consume from Kafka or Kinesis often use micro-batches. If a batch fails and is replayed, the downstream system must handle duplicates without breaking.
The cost of ignoring this calculus is operational debt: manual deduplication scripts, data quality alerts that nobody trusts, and weekend firefights after a rerun. By designing for both idempotency and concurrency from the start, you avoid these crises.
Core Idea in Plain Language
Idempotency means that applying the same operation multiple times has the same effect as applying it once. In pipeline terms: if you run a job today, and then run it again tomorrow with the same input, the output should be identical. Concurrency means that multiple operations execute simultaneously, possibly on different partitions of data.
The conflict arises because concurrency often breaks idempotency. Consider a simple pipeline that reads transactions, aggregates them by account, and writes the result. If two concurrent tasks process overlapping date ranges, they might both try to update the same account balance. Without coordination, one write overwrites the other, and a retry could add the same transaction twice.
The Fundamental Trade-off
You cannot have unlimited concurrency and perfect idempotency without paying a coordination cost. The more you parallelize, the harder it becomes to ensure that operations are deterministic and repeatable. This is not a bug; it's a constraint of distributed systems. The key is to decide where on the spectrum your pipeline should sit based on business requirements.
We can visualize this as a sliding scale:
- Idempotency-first: Serial execution, deterministic ordering, full replay capability. Good for financial reconciliations or regulatory reporting.
- Balanced: Partitioned concurrency with idempotent writes per partition. Good for most ETL workloads.
- Concurrency-first: Optimistic locking, last-write-wins, eventual consistency. Good for high-volume analytics where occasional duplicates are acceptable.
Most teams should target the balanced zone. But to get there, you need to understand the mechanisms that make idempotency and concurrency coexist.
How It Works Under the Hood
Idempotency in pipelines is achieved through three main mechanisms: deterministic replay, idempotent sinks, and unique identifiers. Concurrency, on the other hand, relies on partitioning, ordering, and isolation levels. When these mechanisms interact, subtle bugs emerge.
Deterministic Replay
A pipeline is deterministic if the same input always produces the same intermediate state and output. This requires that all transformations be pure functions—no random numbers, no system clock dependencies, no reading from external state that can change between runs. In practice, many ETL steps violate this: a task might call now() to set a timestamp, or read a configuration file that gets updated. To make a pipeline idempotent, you must pin all non-deterministic inputs at the start of the job and reuse them on retries.
Idempotent Sinks
The write stage is the most common source of non-idempotency. A database upsert (insert or update) is idempotent if the update logic is commutative—that is, applying the same update twice yields the same result as applying it once. For example, SET balance = balance + 10 is not idempotent because running it twice adds 20. But SET balance = 100 is idempotent because the value is absolute. In data pipelines, we often need additive operations, so we must use idempotency keys: a unique row identifier that ensures the operation only happens once.
Unique Identifiers and Deduplication
Each record processed by a pipeline should carry a unique ID. When writing to a sink, you can use this ID to deduplicate: if a row with the same ID already exists, skip the insert or apply a no-op update. This pattern is common in systems like Apache Kafka Connect and AWS Kinesis Firehose. However, deduplication only works if the ID is deterministic across retries. If the ID includes a timestamp or random number, retries will generate new IDs and duplicates will appear.
Concurrency with Partitioning
Concurrency is easier to manage when the data is partitioned by a key that aligns with the idempotency boundary. For example, if you partition by customer ID, then all events for a single customer are processed by the same worker. This worker can maintain ordering and ensure idempotent writes without coordinating with other workers. The challenge is choosing a partition key that balances load and avoids hot spots.
Under the hood, distributed execution frameworks like Apache Spark or Beam handle concurrency via task slots. Each slot processes a partition independently. If a task fails, the framework retries it on another slot—but the new slot may not have the same view of state. This is where idempotency breaks: a retried task might write to a sink that was already partially updated by the original task's partial output.
Worked Example: Late-Arriving Transactions
Let's walk through a realistic scenario. A financial services company processes credit card transactions. They run a daily batch pipeline that aggregates transactions by account and updates the current balance. The pipeline is deployed on a Kubernetes cluster and uses Spark for processing.
The Setup
The input is a CSV file of transactions, each with a transaction_id, account_id, amount, and timestamp. The output is a PostgreSQL table with columns account_id, balance, and last_updated. The pipeline reads the file, groups by account_id, sums the amounts, and writes the result with an upsert: INSERT ... ON CONFLICT (account_id) DO UPDATE SET balance = excluded.balance.
The Problem
One day, a file arrives with transactions from yesterday—late-arriving data. The pipeline has already processed yesterday's file and written balances. Now the new file contains some transactions that were already in yesterday's file (duplicates) plus some new ones. The team reruns the pipeline with both files. Because the upsert is absolute (setting balance to the sum of the current file), the old balance is overwritten. But the new file doesn't include all of yesterday's transactions, so the balance becomes incorrect.
Additionally, the pipeline runs with four concurrent tasks, each processing a subset of accounts. Two tasks happen to receive accounts that overlap because of a bug in the partition function. They both try to upsert the same account, and the order of writes is non-deterministic. The final balance depends on which task ran last.
The Fix
To achieve both idempotency and concurrency, the team makes three changes:
- Use a deterministic partition key: They partition by
account_idusing a hash function that guarantees each account goes to exactly one task. - Add an idempotency key: Each write includes a
batch_idcolumn. The upsert becomesINSERT ... ON CONFLICT (account_id) DO UPDATE SET balance = excluded.balance, batch_id = excluded.batch_id WHERE excluded.batch_id > batch_id. This ensures that only the most recent batch wins, and rerunning the same batch is a no-op. - Use deterministic replay: The pipeline records the input file checksum and the timestamp of the run. On retry, it reuses the same
batch_id, so the upsert condition prevents overwriting with stale data.
After these changes, the team can run the pipeline with any concurrency level, and retries are safe. Late-arriving data is handled by assigning it a new batch ID and running a separate job that only updates accounts affected by the new transactions.
Edge Cases and Exceptions
Even with good design, edge cases can break the idempotency-concurrency balance. Here are the most common ones and how to handle them.
Partial Failures in Distributed Sinks
When a pipeline writes to multiple sinks (e.g., a database and a message queue), a partial failure can leave the system inconsistent. If the database write succeeds but the queue write fails, retrying the task will re-insert the database row, causing a duplicate. The solution is to use a transactional outbox pattern: write all outputs in a single atomic operation, or implement a two-phase commit if the sinks support it.
Exactly-Once Semantics in Streaming
Streaming pipelines that claim “exactly-once” processing often rely on idempotent sinks and checkpointing. But checkpointing itself can be non-idempotent if the state store is not transactional. For example, if a checkpoint is written but the pipeline crashes before the output is committed, the next run may replay data that was already written. This is a known challenge in systems like Kafka Streams and Flink. The workaround is to use idempotent producers and transactional writes, but this adds latency.
Non-Commutative Updates
Some updates are inherently non-commutative. For example, appending to a list: SET items = items || new_item. Running this twice appends the item twice. To make it idempotent, you must either use a set data type or include a uniqueness constraint on the items. In general, prefer absolute assignments over additive ones, and when additive operations are unavoidable, ensure they are idempotent by using idempotency keys.
Cross-Partition Dependencies
If your pipeline has a step that requires data from multiple partitions (e.g., a global sort or join), concurrency becomes much harder. The framework must shuffle data, which introduces coordination points. At these points, idempotency can break if the shuffle is not deterministic. Spark's shuffle is deterministic when the same partitioning and ordering are used, but if a task fails and is retried on a different executor, the shuffle files may be regenerated with different ordering. The fix is to use a deterministic hash partitioner and avoid relying on row order within partitions.
Limits of the Approach
No design pattern is a silver bullet. The balanced approach—partitioned concurrency with idempotent sinks—works for many pipelines but has limits.
Coordination Overhead
Idempotency mechanisms like upsert conditions and idempotency keys add overhead. Each write must include additional columns and checks. In high-throughput pipelines, this can reduce write throughput by 10–30%. If your pipeline processes billions of events per day, the trade-off may lean toward concurrency-first and accept occasional duplicates that are cleaned up in a separate deduplication step.
State Size
Idempotency keys require storage. If you track every record's ID, the state can grow large. For append-only sinks, you can use a bloom filter to probabilistically deduplicate, but false positives can cause data loss. For update-heavy workloads, you need a full index of keys, which increases storage costs.
Non-Deterministic Transformations
Some pipelines must use non-deterministic operations, such as machine learning model inference that depends on random sampling. In these cases, idempotency is impossible unless you seed the random number generator deterministically. Even then, the model itself may be updated between retries, leading to different outputs. The recommendation is to separate non-deterministic steps from the main pipeline and run them in a controlled environment with versioned models.
When to Choose Concurrency-First
If your pipeline is used for exploratory analytics where occasional duplicates are acceptable, or if the cost of coordination outweighs the cost of manual deduplication, consider concurrency-first. Examples include log aggregation for dashboards that tolerate 1% error, or A/B testing pipelines where data is already sampled. In these cases, use a simple last-write-wins strategy and schedule periodic cleanup jobs.
Ultimately, the sultry calculus of idempotency versus concurrency is about knowing your requirements and testing your assumptions. Run chaos experiments: kill tasks mid-way, rerun jobs with overlapping inputs, and verify that the output is correct. Document your idempotency guarantees and communicate them to downstream consumers. With careful design, you can have both speed and safety—but you must weigh the costs honestly.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!