Example: Spark Job Execution

 

Understanding Spark becomes much easier when you see it in action. Let’s break down a simple PySpark example and track how it becomes a Spark Job, broken into Stages and Tasks.

 

Example: Code That Triggers a Spark Job

Breakdown of Execution Flow

  • Reading CSV ( df = spark.read.csv (…)) — No job is triggered yet (lazy evaluation).
  • Transformations ( filter( ), groupBy ()) — Still, no job is triggered; Spark builds a DAG.
  • Action ( show()) — Spark creates a Job and executes it.
  •  

    Step

    What Happens

    Execution Type

    read.csv( )

    Spark loads schema lazily

    No execution yet

    filter( )

    Transformation is recorded in DAG

    Narrow transformation

    groupBy ( ).count ()

    Adds a wide transformation

    Shuffle required

    show( )

    Triggers a Job

    Action: execution begins

     

    11_example_spark_job_execution_image_1.png

     

    Component

    Definition

    Example

    Job

    A high-level execution triggered when an action (like show( ), collect( ), count( )) is called.

    df.show () creates a job

    Stage

    A set of tasks that can be executed in parallel. A job is divided into multiple stages based on transformations.

    filter( ) (Stage 1) → groupBy ( ) (Stage 2)

    Task

    The smallest unit of execution, where each task processes a data partition.

    task1 processes Partition 1, task2 processes Partition 2

     

    How Spark Sees This Code

  • Driver builds a DAG: filter( ) → groupBy ( )
  • Stage 1 (Narrow): Executes filter( df [“age”] > 30)
  • Stage 2 (Wide): Executes groupBy ("city"), causing a shuffle
  • Tasks: Created for each partition and sent to Executors
  • Results: Collected by the Driver and displayed via .show ()
  •  

    Execution Step

    Type

    Example

    filter( df [“age”] > 30)

    Narrow Transformation

    Stays within Stage 1

    groupBy ("city" ).count ()

    Wide Transformation (Shuffle required)

    New Stage 2 is created

    show( )

    Action (Triggers Job)

    Spark creates a Job

    Tasks

    Executed on Worker Nodes

    Each task processes a data partition

     

    💡 Did You Know?

    •   The number of Tasks created equals the number of data partitions. You can control partitions using .repartition () or .coalesce ().

    •   Wide transformations (like groupBy ) require shuffles—these are expensive and should be minimized when possible.

     

    📚 Study Notes

    •   Driver → DAG → Stages → Tasks → Executors

    •   Transformations ( like filter ( ), groupBy ( )) are lazy and only recorded .

    •   Actions ( like show ( ), collect ( )) trigger the actual job execution .

    •   Spark splits the job into Stages depending on shuffle boundaries .

    •   Executors run Tasks in parallel on partitions across Worker Nodes .

    •   You can track this entire process visually in the Spark Web UI .

     

     

     

    Leave a Reply