Spark Application Lifecycle
The lifecycle of a Spark application encompasses seven distinct phases: initialization, planning, scheduling, execution, shuffle, aggregation, and finalization—understanding this lifecycle is critical for debugging, optimization, and predicting application behavior —from the moment you submit code to when results appear, a complex orchestration happens behind the scenes.
The Seven Phases of a Spark Application Lifecycle
Phase 1: Application Submission
What happens:You execute the spark-submit command:
spark – submit — master yarn
— num – executors 10
— executor – cores 4
— executor – memory 4g
my_spark_app .py
Behind the scenes:
Timeline: 1-5 seconds
Key components involved: Spark launcher, Cluster Manager
Phase 2: Resource Negotiation
What happens:Cluster Manager allocates machines for your executors.
Driver : "I need 10 executors with 4 cores and 4GB each"
↓
Cluster Manager : "Checking availability…"
"Yes, I have 10 suitable machines"
"Launching executors…"
↓
Worker Nodes : "Executor JVMs starting…"
Behind the scenes:
Timeline: 30-120 seconds (Kubernetes: 10-30s, YARN: 30-60s)
Key components involved: Cluster Manager, Driver, Executors
Phase 3: Planning and Optimization
What happens:Your code is analyzed and converted into an execution plan.
Your Code :
df = spark .read .parquet ( " data.parquet " )
result = df .filter ( df .age > 25 ).groupBy ( "city" ).count ()
result .write .parquet ( "output" )
Step 1 : Parse into Logical Plan
├─ Read parquet
├─ Filter age > 25
├─ GroupBy city
├─ Count
└─ Write parquet
Step 2 : Optimize with Catalyst
├─ Reorder operations
├─ Push filters down ( read only age > 25 records )
├─ Combine narrow transformations
└─ Choose best algorithms
Step 3 : Create Physical Plan
├─ Read 100 partitions in parallel
├─ Filter in map phase
├─ Shuffle for groupBy
├─ Aggregate to get count
└─ Write results
Behind the scenes:
Timeline: 1-10 seconds (usually negligible)
Key components involved: Driver, Catalyst Optimizer
Phase 4: DAG Scheduling and Task Creation
What happens:Spark breaks the Physical Plan into stages, then stages into tasks.
Physical Plan
↓
DAG ( Directed Acyclic Graph )
├─ Stage 0 : Read + Filter ( no shuffle needed )
│ ├ ─ Task 0 on Executor 1 : Process partition 0
│ ├ ─ Task 1 on Executor 2 : Process partition 1
│ ├ ─ Task 2 on Executor 3 : Process partition 2
│ └ ─ … ( 100 tasks total , one per partition )
│
├─ [ SHUFFLE BARRIER ]
│ └ ─ Shuffle data across network
│
└─ Stage 1 : Aggregate ( post – shuffle )
├─ Task 100 on Executor 1 : Aggregate group 0
├─ Task 101 on Executor 2 : Aggregate group 1
└─ Task 102 on Executor 3 : Aggregate group 2
( 200 tasks total , one per output partition )
Behind the scenes:
Timeline: 1-5 seconds
Key components involved: DAGScheduler , Driver
Phase 5: Task Execution
What happens:Tasks execute in parallel on executors.
Driver Task Queue :
[Task 0, Task 1, Task 2, …, Task 299]
Executor 1 : Processing [ Task 0 , Task 1 , Task 2 ]
Executor 2 : Processing [ Task 3 , Task 4 , Task 5 ]
Executor 3 : Processing [ Task 6 , Task 7 , Task 8 ]
Executor 10 : Processing [ Task 27 , Task 28 , Task 29 ]
( all executing in parallel )
Behind the scenes:
Timeline: Depends on data size and complexity (seconds to hours)
Key components involved: Driver, Executors, Storage System
Typical performance: 100MB/second per executor core (ranges from 10MB-1000MB/s)
Phase 6: Shuffle (If Needed )
What happens:If your operation requires data movement ( groupBy , join), shuffle occurs between stages.
After Stage 0 Execution :
Executor 1 : [ city= "NYC" : rows 1 – 1000 , city= "LA" : rows 1001 – 2000 ]
Executor 2 : [ city= "NYC" : rows 2001 – 3000 , city= "LA" : rows 3001 – 4000 ]
Executor 3 : [ city= "NYC" : rows 4001 – 5000 , city= "LA" : rows 5001 – 6000 ]
[SHUFFLE: Reorganize data]
↓
After Shuffle :
Executor 1 : [ city= "NYC" : all rows ] ( ready for aggregation )
Executor 2 : [ city= "LA" : all rows ] ( ready for aggregation )
Executor 3 : [ city= "SF" : all rows ] ( ready for aggregation )
Now Stage 1 can execute : Group by city is already grouped!
Behind the scenes:
Timeline: Can be very long for large operations (seconds to hours)
Performance: Shuffle is typically the bottleneck. Usually 10-50MB/second per node.
Phase 7: Result Collection and Application Termination
What happens:Results are collected, written to storage, and the application shuts down.
Executors Complete Final Tasks
↓
Results stored (either in output storage or Driver memory)
↓
Driver reports success
↓
Cluster Manager receives "application complete" signal
↓
Cluster Manager kills Executor JVMs
↓
Machines freed and available for other applications
↓
Driver process terminates
Behind the scenes:
Timeline: 1-30 seconds
Key components involved: Driver, Executors, Cluster Manager, Storage
Complete Timeline Example: A Real Spark Job
Let's trace a realistic 1-hour Spark job:
T + 0 :00 User runs : spark – submit — num – executors 50 my_job .py
T + 0 :0 5 Cluster Manager allocates 50 machines
Executor JVMs starting on worker nodes
T + 0 :30 Executors registered with Driver
T + 0 :31 Driver parses code
Catalyst optimizes plan
T + 0 :32 DAG created : 3 stages total
– Stage 0 : Read CSV + Filter ( 400 tasks )
– Stage 1 : Join with reference data ( 800 tasks )
– Stage 2 : Aggregate and write ( 200 tasks )
T + 0 :33 Stage 0 begins : Read + Filter
50 executors x 8 cores = 400 parallel tasks
Processing 100GB CSV file at 50MB/ sec per executor
Total throughput : 50 executors x 8 cores x 50MB/ s = 20GB/ s
100GB ÷ 20GB/ s = 5 seconds reading
T + 0 :40 Stage 0 complete ( 400 tasks finished )
Results ready for shuffle
T + 0 :42 Shuffle phase
50 executors reorganizing data by join key
~50GB of data moving across network
Network bandwidth : 1GB/ sec per executor
Total : 50 executors x 1GB/ s = 50GB/ s
50GB ÷ 50GB/ s = 1 second ( network bound )
But also writing / reading shuffle files :
File I / O can be slow , shuffle takes ~ 5 minutes total
T + 5 :47 Stage 1 begins : Join
Data already grouped by key from shuffle
50 executors joining in parallel
Shuffle output : 50GB
Processing at 100MB/ sec per executor
50 executors x 100MB/ sec = 5GB/ s
50GB ÷ 5GB/ s = 10 seconds
T + 5 :58 Stage 1 complete
T + 6 :00 Stage 2 begins : Aggregate + Write
Aggregating joined data
Writing results to S3
Output : 10GB
Write speed : 50MB/ sec per executor
50 executors x 50MB/ sec = 2.5GB/ s
10GB ÷ 2.5GB/ s = 4 seconds
T + 6 :0 5 All stages complete
T + 6 :10 Executors killed by Cluster Manager
Resources released
Total job time : 6 minutes 10 seconds
Monitoring the Lifecycle: Spark UI
Spark provides a Web UI showing the lifecycle in real-time:
http :// driver – hostname:4040
Tabs Available :
├─ Jobs: Shows each job ( action that triggered execution )
├─ Stages: Shows each stage and task performance
├─ Storage: Shows cached data in executors
├─ Environment: Shows configuration
└─ Executors: Shows executor health and metrics
Each tab shows:
Common Bottlenecks by Phase
|
Phase |
Bottleneck |
Symptom |
Fix |
|---|---|---|---|
|
Resource Negotiation |
Cluster busy |
Long wait before tasks start |
Reduce executor count or wait |
|
Planning |
Complex query |
Takes minutes to optimize |
Simplify query structure |
|
Task Execution |
Slow I/O |
Tasks take long time |
Increase parallelism or tune storage |
|
Shuffle |
Network bandwidth |
Shuffle takes hours |
Reduce shuffle volume or partition better |
|
Result Collection |
Driver memory |
Driver out of memory |
Avoid collect( ), write to storage |
💡 Did You Know?
• Cold start: First Spark job on a cluster is 5-10x slower (JVM warm-up, JIT compilation)
• Shuffle is typically 80% of job time: Optimizing shuffle is the most impactful optimization
• Spark UI is the best debugging tool: 90% of issues visible in Spark UI
• Dynamic allocation: Executors can be added/removed during job execution
• Speculative execution: Spark can re-run slow tasks on other executors to speed up stragglers
📚 Study Notes
• 7 phases: Submission → Resource Negotiation → Planning → DAG → Execution → Shuffle → Termination
• Submission: spark-submit command launches application
• Resource negotiation: Cluster Manager allocates machines
• Planning: Catalyst optimizes logical plan to physical plan
• DAG creation: Physical plan broken into stages and tasks
• Execution: Tasks run in parallel on executors
• Shuffle: Data reorganized between stages (slow phase)
• Termination: Results written, executors killed, resources freed
• Typical job time distribution: 5% planning, 60% execution, 30% shuffle, 5% overhead
• Bottlenecks: Shuffle, network, I/O, or task execution
• Monitoring: Spark UI shows real-time progress for each phase