Spark Application Lifecycle

 

The lifecycle of a Spark application encompasses seven distinct phases: initialization, planning, scheduling, execution, shuffle, aggregation, and finalization—understanding this lifecycle is critical for debugging, optimization, and predicting application behavior —from the moment you submit code to when results appear, a complex orchestration happens behind the scenes.

 

The Seven Phases of a Spark Application Lifecycle

Phase 1: Application Submission

What happens:You execute the spark-submit command:

spark submit master yarn

  num executors 10

  executor cores 4

  executor memory 4g

  my_spark_app .py

Behind the scenes:

  • Spark launcher starts on your machine
  • Reads your Python/Scala/Java code
  • Initializes SparkContext or SparkSession
  • Submits application to Cluster Manager
  • Timeline: 1-5 seconds

    Key components involved: Spark launcher, Cluster Manager

     

    Phase 2: Resource Negotiation

    What happens:Cluster Manager allocates machines for your executors.

    Driver : "I need 10 executors with 4 cores and 4GB each"

             ↓

    Cluster Manager : "Checking availability…"

                     "Yes, I have 10 suitable machines"

                     "Launching executors…"

                     ↓

    Worker Nodes : "Executor JVMs starting…"

    Behind the scenes:

  • Cluster Manager checks available resources
  • Allocates machines for Driver and Executors
  • Launches Driver JVM on one machine
  • Launches Executor JVMs on worker machines
  • Executors register with Driver
  • Timeline: 30-120 seconds (Kubernetes: 10-30s, YARN: 30-60s)

    Key components involved: Cluster Manager, Driver, Executors

     

    Phase 3: Planning and Optimization

    What happens:Your code is analyzed and converted into an execution plan.

    Your Code :

    df = spark .read .parquet ( " data.parquet " )

    result = df .filter ( df .age > 25 ).groupBy ( "city" ).count ()

    result .write .parquet ( "output" )

     

    Step 1 : Parse into Logical Plan

    ├─ Read parquet

    ├─ Filter age > 25

    ├─ GroupBy city

    ├─ Count

    └─ Write parquet

     

    Step 2 : Optimize with Catalyst

    ├─ Reorder operations

    ├─ Push filters down ( read only age > 25 records )

    ├─ Combine narrow transformations

    └─ Choose best algorithms

     

    Step 3 : Create Physical Plan

    ├─ Read 100 partitions in parallel

    ├─ Filter in map phase

    ├─ Shuffle for groupBy

    ├─ Aggregate to get count

    └─ Write results

     

     

    Behind the scenes:

  • Driver parses your code
  • Catalyst optimizer reorders operations
  • Chooses execution strategy (sort-based shuffle vs hash-based)
  • Determines memory requirements
  • Plans data locality
  • Timeline: 1-10 seconds (usually negligible)

    Key components involved: Driver, Catalyst Optimizer

     

    Phase 4: DAG Scheduling and Task Creation

    What happens:Spark breaks the Physical Plan into stages, then stages into tasks.

    Physical Plan

        ↓

    DAG ( Directed Acyclic Graph )

        ├─ Stage 0 : Read + Filter ( no shuffle needed )

        │  ├ ─ Task 0 on Executor 1 : Process partition 0

        │  ├ ─ Task 1 on Executor 2 : Process partition 1

        │  ├ ─ Task 2 on Executor 3 : Process partition 2

        │  └ ( 100 tasks total , one per partition )

        │

        ├─ [ SHUFFLE BARRIER ]

        │  └ ─ Shuffle data across network

        │

        └─ Stage 1 : Aggregate ( post shuffle )

           ├─ Task 100 on Executor 1 : Aggregate group 0

           ├─ Task 101 on Executor 2 : Aggregate group 1

           └─ Task 102 on Executor 3 : Aggregate group 2

               ( 200 tasks total , one per output partition )

     

     

    Behind the scenes:

  • DAGScheduler analyzes dependencies
  • Groups tasks that don't require data movement (stages)
  • Identifies shuffle boundaries (stage separators)
  • Creates task objects with code and data references
  • Builds task queue for executors
  • Timeline: 1-5 seconds

    Key components involved: DAGScheduler , Driver

     

    Phase 5: Task Execution

    What happens:Tasks execute in parallel on executors.

    Driver Task Queue :

    [Task 0, Task 1, Task 2, …, Task 299]

     

    Executor 1 : Processing [ Task 0 , Task 1 , Task 2 ]

    Executor 2 : Processing [ Task 3 , Task 4 , Task 5 ]

    Executor 3 : Processing [ Task 6 , Task 7 , Task 8 ]

    Executor 10 : Processing [ Task 27 , Task 28 , Task 29 ]

     

    ( all executing in parallel )

     

     

    Behind the scenes:

  • Driver assigns tasks to available executors
  • Each executor maintains a task queue
  • Executor deserializes task code
  • Loads partition from storage
  • Executes task on partition
  • Stores results (in memory or cache)
  • Timeline: Depends on data size and complexity (seconds to hours)

    Key components involved: Driver, Executors, Storage System

    Typical performance: 100MB/second per executor core (ranges from 10MB-1000MB/s)

     

    Phase 6: Shuffle (If Needed )

    What happens:If your operation requires data movement ( groupBy , join), shuffle occurs between stages.

    After Stage 0 Execution :

    Executor 1 : [ city= "NYC" : rows 1 1000 , city= "LA" : rows 1001 2000 ]

    Executor 2 : [ city= "NYC" : rows 2001 3000 , city= "LA" : rows 3001 4000 ]

    Executor 3 : [ city= "NYC" : rows 4001 5000 , city= "LA" : rows 5001 6000 ]

     

    [SHUFFLE: Reorganize data]

     

    After Shuffle :

    Executor 1 : [ city= "NYC" : all rows ] ( ready for aggregation )

    Executor 2 : [ city= "LA" : all rows ] ( ready for aggregation )

    Executor 3 : [ city= "SF" : all rows ] ( ready for aggregation )

     

    Now Stage 1 can execute : Group by city is already grouped!

     

     

    Behind the scenes:

  • Shuffle write: Executors write their data to intermediate files
  • Shuffle files are sorted by key
  • Network traffic: All executor machines send to all executor machines
  • Shuffle read: Next stage reads shuffled data
  • Executors load appropriate shuffle files
  • Timeline: Can be very long for large operations (seconds to hours)

    Performance: Shuffle is typically the bottleneck. Usually 10-50MB/second per node.

     

    Phase 7: Result Collection and Application Termination

    What happens:Results are collected, written to storage, and the application shuts down.

    Executors Complete Final Tasks

        ↓

    Results stored (either in output storage or Driver memory)

        ↓

    Driver reports success

        ↓

    Cluster Manager receives "application complete" signal

        ↓

    Cluster Manager kills Executor JVMs

        ↓

    Machines freed and available for other applications

        ↓

    Driver process terminates

     

     

    Behind the scenes:

  • Last stage completes
  • Results written to output location (S3, HDFS, database)
  • Driver collects any remaining results
  • Application logs written
  • Cluster Manager notified
  • Resources released
  • Application terminates
  • Timeline: 1-30 seconds

    Key components involved: Driver, Executors, Cluster Manager, Storage

     

    Complete Timeline Example: A Real Spark Job

    Let's trace a realistic 1-hour Spark job:

    T + 0 :00   User runs : spark submit num executors 50 my_job .py

     

    T + 0 :0 5   Cluster Manager allocates 50 machines

             Executor JVMs starting on worker nodes

     

    T + 0 :30   Executors registered with Driver

     

    T + 0 :31   Driver parses code

             Catalyst optimizes plan

     

    T + 0 :32   DAG created : 3 stages total

              Stage 0 : Read CSV + Filter ( 400 tasks )

              Stage 1 : Join with reference data ( 800 tasks )

              Stage 2 : Aggregate and write ( 200 tasks )

     

    T + 0 :33   Stage 0 begins : Read + Filter

             50 executors x 8 cores = 400 parallel tasks

             Processing 100GB CSV file at 50MB/ sec per executor

             

             Total throughput : 50 executors x 8 cores x 50MB/ s = 20GB/ s

             100GB ÷ 20GB/ s = 5 seconds reading

             

    T + 0 :40   Stage 0 complete ( 400 tasks finished )

             Results ready for shuffle

     

    T + 0 :42   Shuffle phase

             50 executors reorganizing data by join key

             ~50GB of data moving across network

             Network bandwidth : 1GB/ sec per executor

             Total : 50 executors x 1GB/ s = 50GB/ s

             50GB ÷ 50GB/ s = 1 second ( network bound )

             

             But also writing / reading shuffle files :

             File I / O can be slow , shuffle takes ~ 5 minutes total

     

    T + 5 :47   Stage 1 begins : Join

             Data already grouped by key from shuffle

             50 executors joining in parallel

             Shuffle output : 50GB

             Processing at 100MB/ sec per executor

             50 executors x 100MB/ sec = 5GB/ s

             50GB ÷ 5GB/ s = 10 seconds

     

    T + 5 :58   Stage 1 complete

     

    T + 6 :00   Stage 2 begins : Aggregate + Write

             Aggregating joined data

             Writing results to S3

             

             Output : 10GB

             Write speed : 50MB/ sec per executor

             50 executors x 50MB/ sec = 2.5GB/ s

             10GB ÷ 2.5GB/ s = 4 seconds

     

    T + 6 :0 5   All stages complete

     

    T + 6 :10   Executors killed by Cluster Manager

             Resources released

     

    Total job time : 6 minutes 10 seconds

     

     

     

    Monitoring the Lifecycle: Spark UI

    Spark provides a Web UI showing the lifecycle in real-time:

    http :// driver hostname:4040

     

    Tabs Available :

    ├─ Jobs: Shows each job ( action that triggered execution )

    ├─ Stages: Shows each stage and task performance

    ├─ Storage: Shows cached data in executors

    ├─ Environment: Shows configuration

    └─ Executors: Shows executor health and metrics

     

     

    Each tab shows:

  • Execution progress (%)
  • Task completion times
  • Shuffle read/write volumes
  • GC time on executors
  • Memory usage
  •  

    Common Bottlenecks by Phase

    Phase

    Bottleneck

    Symptom

    Fix

    Resource Negotiation

    Cluster busy

    Long wait before tasks start

    Reduce executor count or wait

    Planning

    Complex query

    Takes minutes to optimize

    Simplify query structure

    Task Execution

    Slow I/O

    Tasks take long time

    Increase parallelism or tune storage

    Shuffle

    Network bandwidth

    Shuffle takes hours

    Reduce shuffle volume or partition better

    Result Collection

    Driver memory

    Driver out of memory

    Avoid collect( ), write to storage

     

    💡 Did You Know?

    •   Cold start: First Spark job on a cluster is 5-10x slower (JVM warm-up, JIT compilation)

    •   Shuffle is typically 80% of job time: Optimizing shuffle is the most impactful optimization

    •   Spark UI is the best debugging tool: 90% of issues visible in Spark UI

    •   Dynamic allocation: Executors can be added/removed during job execution

    •   Speculative execution: Spark can re-run slow tasks on other executors to speed up stragglers

     

     

    📚 Study Notes

    •   7 phases: Submission → Resource Negotiation → Planning → DAG → Execution → Shuffle → Termination

    •   Submission: spark-submit command launches application

    •   Resource negotiation: Cluster Manager allocates machines

    •   Planning: Catalyst optimizes logical plan to physical plan

    •   DAG creation: Physical plan broken into stages and tasks

    •   Execution: Tasks run in parallel on executors

    •   Shuffle: Data reorganized between stages (slow phase)

    •   Termination: Results written, executors killed, resources freed

    •   Typical job time distribution: 5% planning, 60% execution, 30% shuffle, 5% overhead

    •   Bottlenecks: Shuffle, network, I/O, or task execution

    •   Monitoring: Spark UI shows real-time progress for each phase

     

     

    Leave a Reply