DAG – Directed Acyclic Graph
The DAG (Directed Acyclic Graph) is Spark's internal representation of your computation: a visual diagram showing all transformations and their dependencies, with nodes representing RDD/ DataFrame operations and edges representing data flow—understanding DAGs is essential for optimizing performance, debugging slow jobs, and grasping how Spark translates your code into parallel execution—the DAG is the bridge between your code and the actual cluster work.
What is a DAG?
Definition
A DAG (Directed Acyclic Graph) is a mathematical graph structure where:
Your Code :
df1 = spark .read .parquet ( " data.parquet " )
df2 = df1 .filter ( df1 .age > 25 )
df3 = df2 .groupBy ( "city" ).count ()
df3 .show ()
DAG Representation :
┌─────────────┐
│ PARQUET │
│ ( Source ) │
└──────┬──────┘
│
↓
┌─────────────────┐
│ FILTER │
│ ( age > 25 ) │
└──────┬──────────┘
│
↓
┌──────────────────┐
│ GROUPBY + COUNT │
│ ( by city ) │
└──────┬───────────┘
│
↓
┌──────────────────┐
│ SHOW ( Action ) │
└──────────────────┘
RDD Lineage vs DAG
RDD Lineage
Each RDD (Resilient Distributed Dataset) remembers how it was created:
rdd1 = sc .textFile ( "data.txt " ) # RDD from file
rdd2 = rdd1 .map ( lambda x : x .upper ()) # RDD from transformation
rdd3 = rdd2 .filter ( lambda x : len ( x ) > 10 ) # RDD from filter
rdd3 .collect () # Triggers execution based on lineage
Lineage chain:
textFile ( data .txt ) → map ( upper ) → filter ( len > 10 ) → collect ( )
If an RDD is lost, Spark recomputes it from the original source using lineage.
DAG vs RDD Lineage
RDD Lineage: Tracks individual RDDs and how each was created
DAG: Complete graph showing all operations and dependencies for entire job
DAG includes:
DAG Creation Process
When you call an action:
1. PARSE CODE
└─ Convert Python/Scala/SQL to internal operations
2. CREATE LOGICAL PLAN
└─ List of operations in order
└─ Read → Filter → GroupBy → Count
3. CREATE RDD LINEAGE
└─ Each operation creates an RDD
└─ Track parent-child relationships
4. OPTIMIZE (Catalyst)
└─ Reorder operations
└─ Combine compatible operations
└─ Push down filters
5. CREATE DAG
└─ Visualize entire computation graph
└─ Show all dependencies
└─ Identify stage boundaries
6. BREAK INTO STAGES
└─ Group operations before shuffle
└─ Separate stages after shuffle
7. CREATE TASKS
└─ One task per partition per stage
└─ Assign to executors
8. EXECUTE
└─ Run tasks in dependency order
DAG Example: Real Application
Your code:
df = spark .read .parquet ( "s3://bucket/ events.parquet " )
# Branch 1
branch1 = df .filter ( df .type == "click" )
count1 = branch1 .count ()
print ( f "Clicks : { count1 } " )
# Branch 2
branch2 = df .filter ( df .type == "impression" )
count2 = branch2 .count ()
print ( f "Impressions : { count2 } " )
# Both branches derived from same source
DAG representation:
┌─────────────────┐
│ PARQUET │
│ ( events .parquet )
└────────┬────────┘
│
┌───────┴───────┐
↓ ↓
┌──────────┐ ┌──────────┐
│ FILTER │ │ FILTER │
│ type = click│ │ type = imp │
└────┬─────┘ └────┬─────┘
│ │
↓ ↓
┌──────────┐ ┌──────────┐
│ COUNT │ │ COUNT │
└────┬─────┘ └────┬─────┘
│ │
┌────┴───────────────┴────┐
│ Results collected to │
│ Driver and printed │
└──────────────────────────┘
Key insight: Both branches reuse the same source RDD. Spark creates the PARQUET RDD once, then reuses it.
Narrow vs Wide Transformations in DAGs
Narrow Transformations (Same Stage)
Operations where each output partition depends on only one input partition:
df .filter () # Each output partition depends on one input
df .map ( ) # Each output depends on one input
df .select () # Each output depends on one input
df .flatMap () # Each output depends on one input
In DAG: All in same stage, no shuffle needed.
Input Partition 1 → Filter → Output Partition 1
Input Partition 2 → Filter → Output Partition 2
Input Partition 3 → Filter → Output Partition 3
( no data movement between partitions )
Wide Transformations (Multiple Stages)
Operations where output partitions depend on multiple input partitions:
df .groupBy () # Must shuffle all matching keys together
df .join () # Must shuffle to co-locate matching keys
df .distinct () # Must shuffle to find duplicates
df .repartition () # Shuffles data to new partition count
df .sortBy () # Shuffles to sort across partitions
In DAG: Creates stage boundary, shuffle required.
Input Partitions → Shuffle Write to files
[Network: Move data]
[STAGE 2: Shuffle Read + Operation]
Shuffle files → Read + GroupBy / Join / etc .
Stage Boundaries in DAG
df = spark .read .parquet ( " data.parquet " )
df = df .filter ( df .age > 25 ) # Narrow
df = df .groupBy ( "city" ).count () # Wide (shuffle!)
df = df .filter ( df [ "count" ] > 100 ) # Narrow
df .write .parquet ( "output" )
DAG with stages:
STAGE 0 (Read + Narrow transforms)
├─ Read parquet
├─ Filter age > 25
└─ [Output: 100 partitions]
[SHUFFLE BOUNDARY]
STAGE 1 (Shuffle + Aggregation)
├─ GroupBy city (shuffle read)
├─ Count
└─ [Output: 200 partitions]
[No shuffle for next operation]
STAGE 2 ( Post-aggregation filter)
├─ Filter count > 100
└─ [Output: 200 partitions]
[Write operation]
Write parquet to output
Viewing DAGs: Spark UI
The Spark Web UI visualizes DAGs:
http :// driver – hostname:4040
Click on a Job → See DAG visualization
The UI shows:
Example DAG visualization in UI:
[Read Parquet]
DAG Optimization by Catalyst
Catalyst optimizer rewrites DAGs:
Original DAG (your code):
Read → Filter A → Select → Filter B → GroupBy
Optimized DAG (what Catalyst creates):
Read → [ Combine Filter A + B ] → Select → GroupBy
( one pass instead of two )
Or even better:
Read ( with Filter A + B pushed to read ) → Select → GroupBy
( read only filtered data from source )
DAG Execution: How Stages Execute
Once DAG is created and split into stages:
DAG with 3 stages:
STAGE 0 STAGE 1 STAGE 2
( Read ) ( GroupBy ) ( Write )
│ │ │
├─Task 0 ┌──┴─Task 4 ┌──┴─Task 7
├─Task 1 │ ─ Task 5 │ ─ Task 8
├─Task 2 │ ─ Task 6 │ ─ Task 9
└─Task 3 └─────────────────┘
Execution Order :
1 . Execute all Stage 0 tasks in parallel ( read + filter )
2 . WAIT for all Stage 0 tasks to complete
3 . Shuffle occurs ( data movement )
4 . Execute all Stage 1 tasks in parallel ( groupby + count )
5 . WAIT for all Stage 1 tasks to complete
6 . Execute all Stage 2 tasks in parallel ( write )
Key: All tasks in a stage must complete before next stage begins.
Understanding DAG Complexity
Simple DAG (Linear)
Read → Filter → Select → GroupBy → Write
└─ One path , no branching
└─ Quick optimization , fast execution
Complex DAG (Multiple Paths)
┌─────────────────────┐
│ Read File │
└────────┬────────────┘
│
┌─────────┴──────────┐
↓ ↓
┌────────┐ ┌────────┐
│ Filter │ │ Select │
│ A > 25 │ │ Cols │
└───┬────┘ └───┬────┘
│ │
├─ Execution Path 1 │
│ GroupBy → Join │
│ ( for report A) │
│ │
└─ Execution Path 2 │
Union ← Join
( for report B)
More complex DAGs take longer to optimize but offer more optimization opportunities.
Common DAG Anti Patterns
Pattern1: Multiple Expensive Operations on Same RDD
Bad :
df = spark .read .parquet ( "100GB.parquet " ) # Expensive read
result1 = df .groupBy ( "col1" ).count () # Read + groupby
result2 = df .filter ( df .col > 10 ).count () # Read + filter AGAIN!
result3 = df .join ( other_df ).count () # Read + join AGAIN!
Each operation re – reads the file !
Fix: Cache after expensive operations
df = spark .read .parquet ( "100GB.parquet" ).cache ()
result1 = df .groupBy ( "col1" ).count ()
result2 = df .filter ( df .col > 10 ).count ()
result3 = df .join ( other_df ).count ()
File read once , cached results reused!
Pattern2: Unnecessary Shuffles
Bad :
df .groupBy ( "col1" ).agg ( …) # Shuffle 1
.repartition ( 100 ) # Shuffle 2 (unnecessary!)
.write .parquet ( …)
Fix: Specify partitions in aggregation
df .repartition ( 100 , "col1 " ) # Shuffle once, partition for write
.groupBy ( "col1" ).agg ( …)
.write .parquet ( …)
📚 Study Notes
• DAG: Directed Acyclic Graph showing operations and dependencies
• Nodes: RDDs, DataFrames , or operations
• Edges: Data dependencies and flow
• RDD Lineage: How each RDD was created (parent → child)
• Narrow transformations: One input partition → one output partition (same stage)
• Wide transformations: Multiple input → multiple output (shuffle boundary)
• Stage boundaries: Created by shuffle operations ( groupBy , join, distinct, etc.)
• Catalyst optimization: Rewrites DAG to minimize data movement
• Push-down filtering: Filters moved to read phase
• Spark UI: Visualizes DAG for debugging and analysis
• Task execution: Tasks in same stage run parallel, stages run sequentially
• Caching: Prevents redundant DAG re-execution
• Complexity: More complex DAGs have more optimization opportunities