Concept
When to Use Batch Processing
What Batch Processing Is
Batch processing collects a set of records — bounded in size or time — and processes them together as a group. The job runs to completion and produces an output. There is a clear start and end. This is the right model for workloads where high latency is acceptable but volume, correctness, and efficiency are paramount.
When Batch Processing Is the Right Choice
- Large-volume data transformations: Moving and reshaping terabytes of data — loading from an OLTP database into a data warehouse, transforming raw event logs into structured analytics tables, aggregating sensor readings across millions of devices.
- Scheduled periodic jobs: Monthly billing runs, daily report generation, weekly payroll processing, nightly ETL into a data warehouse. The work is defined by a time boundary, not by real-time triggers.
- Computationally expensive tasks where latency is acceptable: Training a machine learning model on the day's data. Transcoding video to multiple resolutions. Generating recommendation models. These take minutes to hours — users don't wait for a response.
- ETL pipelines: Extracting data from operational databases, transforming it for analytics (denormalizing, joining, computing aggregates), and loading it into a warehouse like Redshift or BigQuery.
When Batch Processing Is Wrong
- Fraud detection: A fraudulent transaction must be caught before the payment clears — seconds matter. Batch runs hourly won't catch it in time.
- Live dashboards: A dashboard showing real-time order volume or active users must reflect the present second — not last night's batch.
- Real-time user interactions: Anything where the user is waiting for a response is wrong for batch.
Key Design Principles for Batch Jobs
- Idempotency: Batch jobs must be safe to re-run. A job interrupted at 80% must be restartable without duplicating the work already done. Use checkpoints (track progress in a durable store) and idempotency keys on database writes.
- Error handling: A single bad record must never stop an entire batch. Use a Dead-Letter Queue or error table — write bad records aside, continue processing. Alert on elevated error rates.
- Scalability: Containerize batch jobs with Docker + Kubernetes or use managed services (AWS Batch, GCP Dataflow). Scale compute resources proportionally to the data volume.