Batch Processing
Batch processing is the foundational data processing paradigm where large volumes of data are collected, stored, and processed together in discrete groups or batches—understanding batch processing is essential because it remains the dominant approach for most data analytics, reporting, and data warehousing workloads, offering simplicity, scalability, and cost-effectiveness—this comprehensive topic explores how batch processing works, when to use it, and real-world applications.
What is Batch Processing?
Definition and Core Concept
Batch processing is a computing method where:
Key Characteristics
|
Characteristic |
Description |
|---|---|
|
Timing |
Scheduled or event-triggered (not real-time) |
|
Data Volume |
Large volumes processed together |
|
Latency |
Hours to days acceptable |
|
Complexity |
Can handle complex transformations |
|
Cost |
Very efficient for large volumes |
|
Infrastructure |
Simpler, proven technology |
How Batch Processing Works
Basic Workflow
Data Collection
↓
Store in Storage System
↓
Scheduled Batch Job Trigger
↓
Load Data into Memory
↓
Process/Transform Data
↓
Load Results to Destination
↓
Notify Stakeholders
↓
Clean up/Archive
Practical Example: Daily Sales Report
Day 1 :
– Morning: Sales occur throughout the day
– Evening: All transactions stored in database
Day 2 :
– 2 :00 AM : Batch job starts
– 2 :15 AM : Load all yesterday's transactions
– 2 :30 AM : Calculate totals , aggregations , insights
– 2 :45 AM : Generate reports , send emails
– 3 :00 AM : Archive data , cleanup
Result : Sales report ready by 9 :00 AM business hours
Batch Processing Tools and Technologies
Traditional Batch Processing
MapReduce → Spark → Flink Batch
↓
Large – scale distributed batch processing
Tools :
– Apache Spark (most popular)
– Hadoop MapReduce
– Apache Hive
– Presto / Trino
– Dask (Python)
Common Batch Processing Patterns
Pattern 1: ETL (Extract, Transform, Load)
from pyspark .sql import SparkSession
spark = SparkSession .builder .appName ( " BatchETL " ).getOrCreate ()
# Extract: Read source data
df_raw = spark .read .parquet ( "s3://bucket/raw-data/2024-01-01/" )
# Transform: Clean and process
df_clean = df_ raw .filter ( df_ raw .status == "completed" )
.withColumn ( " amount_usd " , df_ raw .amount * 1.1 )
# Load: Write to destination
df_ clean .write .parquet ( "s3://bucket/processed-data/2024-01-01/" )
spark .stop ()
Pattern 2: Aggregation and Reporting
# Calculate daily sales summary
daily_summary = df .groupBy ( "date" , "product" )
.agg (
sum ( "quantity" ).alias ( " total_quantity " ),
sum ( "revenue" ).alias ( " total_revenue " ),
avg ( "price" ).alias ( " avg_price " )
)
daily_ summary .write .mode ( "overwrite" ).parquet ( "output/daily_summary" )
Pattern 3: Data Deduplication
from pyspark .sql .functions import row_number , max
from pyspark .sql .window import Window
# Remove duplicates keeping latest record
window = Window .partitionBy ( "customer_id" ).orderBy ( col ( "timestamp" ).desc ())
df_deduplicated = df .withColumn ( " row_num " , row_number ( ).over ( window ))
.filter ( col ( " row_num " ) == 1 )
.drop ( " row_num " )
When to Use Batch Processing
Ideal Use Cases
When NOT to Use
Batch Processing Architecture
Traditional Architecture
Data Sources
(Database, Files, APIs)
↓
Batch Storage
(HDFS, S3, Data Lake)
↓
Batch Processing Engine
(Spark, Hadoop, Flink)
↓
Processing Results
↓
Data Warehouse/Data Mart
↓
BI Tools/Reports/Dashboards
↓
End Users
Scheduling Batch Jobs
Common Scheduling Methods
1. Time-Based (Cron)
# Daily at 2:00 AM
0 2 * * * / opt / spark / bin / spark – submit batch_job .py
# Every 6 hours
0 */ 6 * * * / opt / spark / bin / spark – submit batch_job .py
# Weekly on Monday at 1:00 AM
0 1 * * 1 / opt / spark / bin / spark – submit batch_job .py
2. Workflow Orchestration (Airflow)
from airflow import DAG
from airflow .operators .spark_submit_operator import SparkSubmitOperator
from datetime import datetime
default_args = {
'owner' : ' data_team ' ,
' start _date ' : datetime ( 2024 , 1 , 1 ),
}
dag = DAG ( ' daily_batch_job ' , default_args = default_args , schedule_interval = '0 2 * * *' )
spark_job = SparkSubmitOperator (
task_id = ' run_batch_process ' ,
application = '/opt/spark/jobs/process_data.py' ,
conf = { ' spark.executor .memory ' : '8g' } ,
dag = dag
)
3. Event-Driven
# Process when file arrives in S3
import boto3
s3 = boto3 .client ( 's3' )
def lambda_ handler ( event , context ):
# Triggered when new file lands in S3
bucket = event [ 'Records ' ][ 's3 ' ][ 'bucket ' ][ 'name' ]
key = event [ 'Records ' ][ 's3 ' ][ 'object ' ][ 'key' ]
# Launch Spark job
os .system ( f 'spark -submit process.py { bucket } / { key } ' )
Batch Processing Performance Considerations
Optimization Strategies
df .write .partitionBy ( "date" , "region" ).parquet ( "output/" )
df .write .bucketBy ( 10 , " customer_id " ).mode ( "overwrite" ).parquet ( "output/" )
spark .conf .set ( " spark.hadoop .mapreduce.output .compress " , "true" )
spark .conf .set ( " spark.hadoop .mapreduce.output .compress.codec " ,
" org.apache.hadoop.hive.ql.io.compress .SnappyCodec " )
spark – submit — executor – memory 8g — executor – cores 4 job .py
Real-World Example: Daily E-commerce Batch Job
from pyspark .sql import SparkSession
from pyspark .sql .functions import col , sum , avg , count
from datetime import datetime , timedelta
spark = SparkSession .builder
.appName ( " DailyEcommerceBatch " )
.config ( " spark.sql.shuffle .partitions " , "200" )
.getOrCreate ()
# Configuration
yesterday = ( datetime .now ( ) – timedelta ( days = 1 ) ).strftime ( "%Y-%m- %d " )
input_path = f "s3://ecommerce-raw/transactions/ { yesterday } /"
output_path = f "s3://ecommerce-processed/ daily_summary / { yesterday } /"
try :
# Extract: Read raw transactions
df_transactions = spark .read .parquet ( input_path )
print ( f "Loaded { df_ transactions .count () } transactions" )
# Transform: Filter and enrich
df_valid = df_ transactions .filter (
( col ( "status" ) == "completed" ) &
( col ( "amount" ) > 0 )
)
# Aggregate: Calculate daily metrics
daily_metrics = df_ valid .groupBy ( " product_category " , "region" )
.agg (
count ( " transaction_id " ).alias ( " transaction_count " ),
sum ( "amount" ).alias ( " total_revenue " ),
avg ( "amount" ).alias ( " avg_order_value " )
)
# Calculate percentage of total
total_revenue = daily_metrics .agg ( { " total_revenue " : "sum" } ).collect ()
daily_metrics = daily_ metrics .withColumn (
" pct _of_total " ,
( col ( " total_revenue " ) / total_revenue ) * 100
)
# Load: Write to processed zone
daily_ metrics .write .mode ( "overwrite" ).parquet ( output_path )
# Cleanup: Archive raw data
print ( f " ✅ Batch job completed. Results in { output_path } " )
except Exception as e :
print ( f " ❌ Batch job failed: { e } " )
# Send alert to ops team
raise
finally :
spark .stop ()
Batch Processing vs Real-Time
|
Aspect |
Batch |
Real-Time |
|---|---|---|
|
Latency |
Hours/Days |
Milliseconds |
|
Data Volume |
Large volumes |
Continuous streams |
|
Complexity |
Complex transformations |
Simple operations |
|
Cost |
Lower |
Higher |
|
Technology |
Spark, Hadoop |
Kafka, Flink |
|
Use Cases |
Reporting, Analytics |
Alerts, Monitoring |
Best Practices
• Idempotency: Jobs should produce same results when rerun
• Monitoring: Track job success/failure, duration, resources
• Data Quality: Validate data before and after processing
• Error Handling: Graceful failures with rollback capability
• Logging: Comprehensive logging for debugging
• Scheduling: Account for job duration in schedule gaps
• Retention: Define data retention policies
💡 Key Insights
• Batch processing dominates enterprise: 80%+ of data processing is batch
• Cost-effective at scale: Process millions of rows for cents
• Mature technology: Well-established best practices
• Excellent for reporting: Perfect for scheduled analytics
• Trade-off: Latency vs cost and simplicity
📚 Study Notes
• Batch processing: Data collected and processed together
• Scheduled: Runs at predetermined times
• Large-scale: Optimized for big volumes
• Cost-effective: Efficient resource utilization
• Common tools: Spark, Hadoop, Hive
• Use cases: Reporting, warehousing, migrations
• Scheduling: Cron, Airflow, event-driven