Batch Processing

 

Batch processing is the foundational data processing paradigm where large volumes of data are collected, stored, and processed together in discrete groups or batches—understanding batch processing is essential because it remains the dominant approach for most data analytics, reporting, and data warehousing workloads, offering simplicity, scalability, and cost-effectiveness—this comprehensive topic explores how batch processing works, when to use it, and real-world applications.

 

What is Batch Processing?

Definition and Core Concept

Batch processing is a computing method where:

  • Data is collected over time into groups (batches)
  • Batches are processed together as a single unit
  • Processing happens at predetermined times or when batch size is reached
  • Results are generated after the entire batch is processed
  • Key Characteristics

    Characteristic

    Description

    Timing

    Scheduled or event-triggered (not real-time)

    Data Volume

    Large volumes processed together

    Latency

    Hours to days acceptable

    Complexity

    Can handle complex transformations

    Cost

    Very efficient for large volumes

    Infrastructure

    Simpler, proven technology

     

    How Batch Processing Works

    Basic Workflow

    Data Collection

        ↓

    Store in Storage System

        ↓

    Scheduled Batch Job Trigger

        ↓

    Load Data into Memory

        ↓

    Process/Transform Data

        ↓

    Load Results to Destination

        ↓

    Notify Stakeholders

        ↓

    Clean up/Archive

     

     

     

    Practical Example: Daily Sales Report

    Day 1 :

    Morning: Sales occur throughout the day

    Evening: All transactions stored in database

     

    Day 2 :

    2 :00 AM : Batch job starts

    2 :15 AM : Load all yesterday's transactions

    2 :30 AM : Calculate totals , aggregations , insights

    2 :45 AM : Generate reports , send emails

    3 :00 AM : Archive data , cleanup

     

    Result : Sales report ready by 9 :00 AM business hours

     

     

     

    Batch Processing Tools and Technologies

    Traditional Batch Processing

    MapReduce → Spark → Flink Batch

      ↓

    Large scale distributed batch processing

     

    Tools :

    Apache Spark (most popular)

    Hadoop MapReduce

    Apache Hive

    Presto / Trino

    Dask (Python)

     

     

     

    Common Batch Processing Patterns

    Pattern 1: ETL (Extract, Transform, Load)

    from pyspark .sql import SparkSession

     

    spark = SparkSession .builder .appName ( " BatchETL " ).getOrCreate ()

     

    # Extract: Read source data

    df_raw = spark .read .parquet ( "s3://bucket/raw-data/2024-01-01/" )

     

    # Transform: Clean and process

    df_clean = df_ raw .filter ( df_ raw .status == "completed" )

                      .withColumn ( " amount_usd " , df_ raw .amount * 1.1 )

     

    # Load: Write to destination

    df_ clean .write .parquet ( "s3://bucket/processed-data/2024-01-01/" )

     

    spark .stop ()

     

     

    Pattern 2: Aggregation and Reporting

    # Calculate daily sales summary

    daily_summary = df .groupBy ( "date" , "product" )

                      .agg (

                          sum ( "quantity" ).alias ( " total_quantity " ),

                          sum ( "revenue" ).alias ( " total_revenue " ),

                          avg ( "price" ).alias ( " avg_price " )

                      )

     

    daily_ summary .write .mode ( "overwrite" ).parquet ( "output/daily_summary" )

     

     

    Pattern 3: Data Deduplication

    from pyspark .sql .functions import row_number , max

    from pyspark .sql .window import Window

     

    # Remove duplicates keeping latest record

    window = Window .partitionBy ( "customer_id" ).orderBy ( col ( "timestamp" ).desc ())

    df_deduplicated = df .withColumn ( " row_num " , row_number ( ).over ( window ))

                        .filter ( col ( " row_num " ) == 1 )

                        .drop ( " row_num " )

     

     

     

    When to Use Batch Processing

    Ideal Use Cases

  • Data Warehousing: Load daily snapshots
  • Reporting: Generate scheduled reports
  • Data Migration: Move large datasets
  • Aggregations: Calculate summaries
  • Machine Learning: Train models on historical data
  • Data Validation: Quality checks on large volumes
  • Backups: Archive and backup operations
  • Cost-Sensitive: Process during off-peak hours
  • When NOT to Use

  • Real-time Requirements: Need immediate results
  • Interactive Analysis: Exploratory queries
  • Streaming Events: Continuous data flow
  • Low-Latency: Sub-second response needed
  •  

    Batch Processing Architecture

    Traditional Architecture

    Data Sources

    (Database, Files, APIs)

        ↓

    Batch Storage

    (HDFS, S3, Data Lake)

        ↓

    Batch Processing Engine

    (Spark, Hadoop, Flink)

        ↓

    Processing Results

        ↓

    Data Warehouse/Data Mart

        ↓

    BI Tools/Reports/Dashboards

        ↓

    End Users

     

    Scheduling Batch Jobs

    Common Scheduling Methods

    1. Time-Based (Cron)

    # Daily at 2:00 AM

    0 2 * * * / opt / spark / bin / spark submit batch_job .py

     

    # Every 6 hours

    0 */ 6 * * * / opt / spark / bin / spark submit batch_job .py

     

    # Weekly on Monday at 1:00 AM

    0 1 * * 1 / opt / spark / bin / spark submit batch_job .py

    2. Workflow Orchestration (Airflow)

    from airflow import DAG

    from airflow .operators .spark_submit_operator import SparkSubmitOperator

    from datetime import datetime

     

    default_args = {

        'owner' : ' data_team ' ,

        ' start _date ' : datetime ( 2024 , 1 , 1 ),

    }

     

    dag = DAG ( ' daily_batch_job ' , default_args = default_args , schedule_interval = '0 2 * * *' )

     

    spark_job = SparkSubmitOperator (

        task_id = ' run_batch_process ' ,

        application = '/opt/spark/jobs/process_data.py' ,

        conf = { ' spark.executor .memory ' : '8g' } ,

        dag = dag

    )

    3. Event-Driven

    # Process when file arrives in S3

    import boto3

     

    s3 = boto3 .client ( 's3' )

     

    def lambda_ handler ( event , context ):

        # Triggered when new file lands in S3

        bucket = event [ 'Records ' ][ 's3 ' ][ 'bucket ' ][ 'name' ]

        key = event [ 'Records ' ][ 's3 ' ][ 'object ' ][ 'key' ]

       

        # Launch Spark job

        os .system ( f 'spark -submit process.py { bucket } / { key } ' )

     

     

     

    Batch Processing Performance Considerations

    Optimization Strategies

  • Partitioning: Divide data by date, region, category
  • df .write .partitionBy ( "date" , "region" ).parquet ( "output/" )

     

     

  • Bucketing: Sort data within partitions
  • df .write .bucketBy ( 10 , " customer_id " ).mode ( "overwrite" ).parquet ( "output/" )

     

     

  • Compression: Reduce storage and I/O
  • spark .conf .set ( " spark.hadoop .mapreduce.output .compress " , "true" )

    spark .conf .set ( " spark.hadoop .mapreduce.output .compress.codec " ,

                   " org.apache.hadoop.hive.ql.io.compress .SnappyCodec " )

     

     

  • Resource Allocation: Tune executor memory and cores
  • spark submit executor memory 8g executor cores 4 job .py

     

     

     

    Real-World Example: Daily E-commerce Batch Job

    from pyspark .sql import SparkSession

    from pyspark .sql .functions import col , sum , avg , count

    from datetime import datetime , timedelta

     

    spark = SparkSession .builder

        .appName ( " DailyEcommerceBatch " )

        .config ( " spark.sql.shuffle .partitions " , "200" )

        .getOrCreate ()

     

    # Configuration

    yesterday = ( datetime .now ( ) timedelta ( days = 1 ) ).strftime ( "%Y-%m- %d " )

    input_path = f "s3://ecommerce-raw/transactions/ { yesterday } /"

    output_path = f "s3://ecommerce-processed/ daily_summary / { yesterday } /"

     

    try :

        # Extract: Read raw transactions

        df_transactions = spark .read .parquet ( input_path )

       

        print ( f "Loaded { df_ transactions .count () } transactions" )

       

        # Transform: Filter and enrich

        df_valid = df_ transactions .filter (

            ( col ( "status" ) == "completed" ) &

            ( col ( "amount" ) > 0 )

        )

       

        # Aggregate: Calculate daily metrics

        daily_metrics = df_ valid .groupBy ( " product_category " , "region" )

            .agg (

                count ( " transaction_id " ).alias ( " transaction_count " ),

                sum ( "amount" ).alias ( " total_revenue " ),

                avg ( "amount" ).alias ( " avg_order_value " )

            )

       

        # Calculate percentage of total

        total_revenue = daily_metrics .agg ( { " total_revenue " : "sum" } ).collect ()

        daily_metrics = daily_ metrics .withColumn (

            " pct _of_total " ,

            ( col ( " total_revenue " ) / total_revenue ) * 100

        )

       

        # Load: Write to processed zone

        daily_ metrics .write .mode ( "overwrite" ).parquet ( output_path )

       

        # Cleanup: Archive raw data

        print ( f " Batch job completed. Results in { output_path } " )

       

    except Exception as e :

        print ( f " Batch job failed: { e } " )

        # Send alert to ops team

        raise

     

    finally :

        spark .stop ()

     

     

     

    Batch Processing vs Real-Time

    Aspect

    Batch

    Real-Time

    Latency

    Hours/Days

    Milliseconds

    Data Volume

    Large volumes

    Continuous streams

    Complexity

    Complex transformations

    Simple operations

    Cost

    Lower

    Higher

    Technology

    Spark, Hadoop

    Kafka, Flink

    Use Cases

    Reporting, Analytics

    Alerts, Monitoring

    Best Practices

    •   Idempotency: Jobs should produce same results when rerun

    •   Monitoring: Track job success/failure, duration, resources

    •   Data Quality: Validate data before and after processing

    •   Error Handling: Graceful failures with rollback capability

    •   Logging: Comprehensive logging for debugging

    •   Scheduling: Account for job duration in schedule gaps

    •   Retention: Define data retention policies

    💡 Key Insights

    •   Batch processing dominates enterprise: 80%+ of data processing is batch

    •   Cost-effective at scale: Process millions of rows for cents

    •   Mature technology: Well-established best practices

    •   Excellent for reporting: Perfect for scheduled analytics

    •   Trade-off: Latency vs cost and simplicity

     

     

    📚 Study Notes

    •   Batch processing: Data collected and processed together

    •   Scheduled: Runs at predetermined times

    •   Large-scale: Optimized for big volumes

    •   Cost-effective: Efficient resource utilization

    •   Common tools: Spark, Hadoop, Hive

    •   Use cases: Reporting, warehousing, migrations

    •   Scheduling: Cron, Airflow, event-driven

    Leave a Reply