Batch Processing

Batch processing is the foundational data processing paradigm where large volumes of data are collected, stored, and processed together in discrete groups or batches—understanding batch processing is essential because it remains the dominant approach for most data analytics, reporting, and data warehousing workloads, offering simplicity, scalability, and cost-effectiveness—this comprehensive topic explores how batch processing works, when to use it, and real-world applications.

What is Batch Processing?

Definition and Core Concept

Batch processing is a computing method where:

Data is collected over time into groups (batches)

Batches are processed together as a single unit

Processing happens at predetermined times or when batch size is reached

Results are generated after the entire batch is processed

Key Characteristics

Characteristic	Description
Timing	Scheduled or event-triggered (not real-time)
Data Volume	Large volumes processed together
Latency	Hours to days acceptable
Complexity	Can handle complex transformations
Cost	Very efficient for large volumes
Infrastructure	Simpler, proven technology

How Batch Processing Works

Basic Workflow

Data Collection

↓

Store in Storage System

↓

Scheduled Batch Job Trigger

↓

Load Data into Memory

↓

Process/Transform Data

↓

Load Results to Destination

↓

Notify Stakeholders

↓

Clean up/Archive

Practical Example: Daily Sales Report

Day 1 :

– Morning: Sales occur throughout the day

– Evening: All transactions stored in database

Day 2 :

– 2 :00 AM : Batch job starts

– 2 :15 AM : Load all yesterday's transactions

– 2 :30 AM : Calculate totals , aggregations , insights

– 2 :45 AM : Generate reports , send emails

– 3 :00 AM : Archive data , cleanup

Result : Sales report ready by 9 :00 AM business hours

Batch Processing Tools and Technologies

Traditional Batch Processing

MapReduce → Spark → Flink Batch

↓

Large – scale distributed batch processing

Tools :

– Apache Spark (most popular)

– Hadoop MapReduce

– Apache Hive

– Presto / Trino

– Dask (Python)

Common Batch Processing Patterns

Pattern 1: ETL (Extract, Transform, Load)

from pyspark .sql import SparkSession

spark = SparkSession .builder .appName ( " BatchETL " ).getOrCreate ()

# Extract: Read source data

df_raw = spark .read .parquet ( "s3://bucket/raw-data/2024-01-01/" )

# Transform: Clean and process

df_clean = df_ raw .filter ( df_ raw .status == "completed" )

.withColumn ( " amount_usd " , df_ raw .amount * 1.1 )

# Load: Write to destination

df_ clean .write .parquet ( "s3://bucket/processed-data/2024-01-01/" )

spark .stop ()

Pattern 2: Aggregation and Reporting

# Calculate daily sales summary

daily_summary = df .groupBy ( "date" , "product" )

.agg (

sum ( "quantity" ).alias ( " total_quantity " ),

sum ( "revenue" ).alias ( " total_revenue " ),

avg ( "price" ).alias ( " avg_price " )

)

daily_ summary .write .mode ( "overwrite" ).parquet ( "output/daily_summary" )

Pattern 3: Data Deduplication

from pyspark .sql .functions import row_number , max

from pyspark .sql .window import Window

# Remove duplicates keeping latest record

window = Window .partitionBy ( "customer_id" ).orderBy ( col ( "timestamp" ).desc ())

df_deduplicated = df .withColumn ( " row_num " , row_number ( ).over ( window ))

.filter ( col ( " row_num " ) == 1 )

.drop ( " row_num " )

When to Use Batch Processing

Ideal Use Cases

Data Warehousing: Load daily snapshots

Reporting: Generate scheduled reports

Data Migration: Move large datasets

Aggregations: Calculate summaries

Machine Learning: Train models on historical data

Data Validation: Quality checks on large volumes

Backups: Archive and backup operations

Cost-Sensitive: Process during off-peak hours

When NOT to Use

Real-time Requirements: Need immediate results

Interactive Analysis: Exploratory queries

Streaming Events: Continuous data flow

Low-Latency: Sub-second response needed

Batch Processing Architecture

Traditional Architecture

Data Sources

(Database, Files, APIs)

↓

Batch Storage

(HDFS, S3, Data Lake)

↓

Batch Processing Engine

(Spark, Hadoop, Flink)

↓

Processing Results

↓

Data Warehouse/Data Mart

↓

BI Tools/Reports/Dashboards

↓

End Users

Scheduling Batch Jobs

Common Scheduling Methods

1. Time-Based (Cron)

# Daily at 2:00 AM

0 2 * * * / opt / spark / bin / spark – submit batch_job .py

# Every 6 hours

0 */ 6 * * * / opt / spark / bin / spark – submit batch_job .py

# Weekly on Monday at 1:00 AM

0 1 * * 1 / opt / spark / bin / spark – submit batch_job .py

2. Workflow Orchestration (Airflow)

from airflow import DAG

from airflow .operators .spark_submit_operator import SparkSubmitOperator

from datetime import datetime

default_args = {

'owner' : ' data_team ' ,

' start _date ' : datetime ( 2024 , 1 , 1 ),

}

dag = DAG ( ' daily_batch_job ' , default_args = default_args , schedule_interval = '0 2 * * *' )

spark_job = SparkSubmitOperator (

task_id = ' run_batch_process ' ,

application = '/opt/spark/jobs/process_data.py' ,

conf = { ' spark.executor .memory ' : '8g' } ,

dag = dag

)

3. Event-Driven

# Process when file arrives in S3

import boto3

s3 = boto3 .client ( 's3' )

def lambda_ handler ( event , context ):

# Triggered when new file lands in S3

bucket = event [ 'Records ' ][ 's3 ' ][ 'bucket ' ][ 'name' ]

key = event [ 'Records ' ][ 's3 ' ][ 'object ' ][ 'key' ]

# Launch Spark job

os .system ( f 'spark -submit process.py { bucket } / { key } ' )

Batch Processing Performance Considerations

Optimization Strategies

Partitioning: Divide data by date, region, category

df .write .partitionBy ( "date" , "region" ).parquet ( "output/" )

Bucketing: Sort data within partitions

df .write .bucketBy ( 10 , " customer_id " ).mode ( "overwrite" ).parquet ( "output/" )

Compression: Reduce storage and I/O

spark .conf .set ( " spark.hadoop .mapreduce.output .compress " , "true" )

spark .conf .set ( " spark.hadoop .mapreduce.output .compress.codec " ,

" org.apache.hadoop.hive.ql.io.compress .SnappyCodec " )

Resource Allocation: Tune executor memory and cores

spark – submit — executor – memory 8g — executor – cores 4 job .py

Real-World Example: Daily E-commerce Batch Job

from pyspark .sql import SparkSession

from pyspark .sql .functions import col , sum , avg , count

from datetime import datetime , timedelta

spark = SparkSession .builder

.appName ( " DailyEcommerceBatch " )

.config ( " spark.sql.shuffle .partitions " , "200" )

.getOrCreate ()

# Configuration

yesterday = ( datetime .now ( ) – timedelta ( days = 1 ) ).strftime ( "%Y-%m- %d " )

input_path = f "s3://ecommerce-raw/transactions/ { yesterday } /"

output_path = f "s3://ecommerce-processed/ daily_summary / { yesterday } /"

try :

# Extract: Read raw transactions

df_transactions = spark .read .parquet ( input_path )

print ( f "Loaded { df_ transactions .count () } transactions" )

# Transform: Filter and enrich

df_valid = df_ transactions .filter (

( col ( "status" ) == "completed" ) &

( col ( "amount" ) > 0 )

)

# Aggregate: Calculate daily metrics

daily_metrics = df_ valid .groupBy ( " product_category " , "region" )

.agg (

count ( " transaction_id " ).alias ( " transaction_count " ),

sum ( "amount" ).alias ( " total_revenue " ),

avg ( "amount" ).alias ( " avg_order_value " )

)

# Calculate percentage of total

total_revenue = daily_metrics .agg ( { " total_revenue " : "sum" } ).collect ()

daily_metrics = daily_ metrics .withColumn (

" pct _of_total " ,

( col ( " total_revenue " ) / total_revenue ) * 100

)

# Load: Write to processed zone

daily_ metrics .write .mode ( "overwrite" ).parquet ( output_path )

# Cleanup: Archive raw data

print ( f " ✅ Batch job completed. Results in { output_path } " )

except Exception as e :

print ( f " ❌ Batch job failed: { e } " )

# Send alert to ops team

raise

finally :

spark .stop ()

Batch Processing vs Real-Time

Aspect	Batch	Real-Time
Latency	Hours/Days	Milliseconds
Data Volume	Large volumes	Continuous streams
Complexity	Complex transformations	Simple operations
Cost	Lower	Higher
Technology	Spark, Hadoop	Kafka, Flink
Use Cases	Reporting, Analytics	Alerts, Monitoring

Best Practices

• Idempotency: Jobs should produce same results when rerun

• Monitoring: Track job success/failure, duration, resources

• Data Quality: Validate data before and after processing

• Error Handling: Graceful failures with rollback capability

• Logging: Comprehensive logging for debugging

• Scheduling: Account for job duration in schedule gaps

• Retention: Define data retention policies

💡 Key Insights

• Batch processing dominates enterprise: 80%+ of data processing is batch

• Cost-effective at scale: Process millions of rows for cents

• Mature technology: Well-established best practices

• Excellent for reporting: Perfect for scheduled analytics

• Trade-off: Latency vs cost and simplicity

📚 Study Notes

• Batch processing: Data collected and processed together

• Scheduled: Runs at predetermined times

• Large-scale: Optimized for big volumes

• Cost-effective: Efficient resource utilization

• Common tools: Spark, Hadoop, Hive

• Use cases: Reporting, warehousing, migrations

• Scheduling: Cron, Airflow, event-driven

Navigation

Data Fundamentals

Leave a Reply Cancel reply