Example - Spark Job Execution

Example: Spark Job Execution

Understanding Spark becomes much easier when you see it in action. Let’s break down a simple PySpark example and track how it becomes a Spark Job, broken into Stages and Tasks.

Example: Code That Triggers a Spark Job

Breakdown of Execution Flow

Reading CSV ( df = spark.read.csv (…)) — No job is triggered yet (lazy evaluation).

Transformations ( filter( ), groupBy ()) — Still, no job is triggered; Spark builds a DAG.

Action ( show()) — Spark creates a Job and executes it.

Step	What Happens	Execution Type
read.csv( )	Spark loads schema lazily	No execution yet
filter( )	Transformation is recorded in DAG	Narrow transformation
groupBy ( ).count ()	Adds a wide transformation	Shuffle required
show( )	Triggers a Job	Action: execution begins

Component	Definition	Example
Job	A high-level execution triggered when an action (like show( ), collect( ), count( )) is called.	df.show () creates a job
Stage	A set of tasks that can be executed in parallel. A job is divided into multiple stages based on transformations.	filter( ) (Stage 1) → groupBy ( ) (Stage 2)
Task	The smallest unit of execution, where each task processes a data partition.	task1 processes Partition 1, task2 processes Partition 2

How Spark Sees This Code

Driver builds a DAG: filter( ) → groupBy ( )

Stage 1 (Narrow): Executes filter( df [“age”] > 30)

Stage 2 (Wide): Executes groupBy ("city"), causing a shuffle

Tasks: Created for each partition and sent to Executors

Results: Collected by the Driver and displayed via .show ()

Execution Step	Type	Example
filter( df [“age”] > 30)	Narrow Transformation	Stays within Stage 1
groupBy ("city" ).count ()	Wide Transformation (Shuffle required)	New Stage 2 is created
show( )	Action (Triggers Job)	Spark creates a Job
Tasks	Executed on Worker Nodes	Each task processes a data partition

💡 Did You Know?

• The number of Tasks created equals the number of data partitions. You can control partitions using .repartition () or .coalesce ().

• Wide transformations (like groupBy ) require shuffles—these are expensive and should be minimized when possible.

📚 Study Notes

• Driver → DAG → Stages → Tasks → Executors

• Transformations ( like filter ( ), groupBy ( )) are lazy and only recorded .

• Actions ( like show ( ), collect ( )) trigger the actual job execution .

• Spark splits the job into Stages depending on shuffle boundaries .

• Executors run Tasks in parallel on partitions across Worker Nodes .

• You can track this entire process visually in the Spark Web UI .

Navigation

Apache Spark

Leave a Reply Cancel reply