Spark Architecture

Apache Spark is designed to process big data by distributing tasks across multiple machines. Its architecture ensures high speed, scalability, and fault tolerance—making it a top choice for modern data systems.

Spark Follows a Master- Worker Architecture

Apache Spark processes large-scale data across distributed machines using a Master-Slave architecture.

The Driver coordinates the job, and Executors handle the actual execution in parallel across nodes.

At a high level, Spark consists of three key components:

Spark Driver (Master Node) –

Acts as the main coordinator of the Spark application.

Converts your code into a DAG (Directed Acyclic Graph).

Breaks the DAG into stages and tasks, and schedules them on executors.

Cluster Manager

Manages resources across the cluster.

Allocates Executors to run your tasks.

Spark supports multiple cluster managers:

YARN (Hadoop)

Kubernetes

Apache Mesos

Standalone mode

Executors (Worker Nodes)

Run the actual tasks in parallel.

Store intermediate results in memory for efficiency.

Continuously communicate with the Driver for task status and completion.

Spark Architecture Overview

Driver Program: Includes Spark Application, Driver, and Session. Submits the job and controls execution.

Cluster Manager: Allocates resources and launches Executors on Worker Nodes.

Worker Nodes (Executors): Run tasks and process data partitions in parallel.

Task Flow: Data and transformations are sent to Executors; results flow back to Driver after execution.

Step-by-Step: Spark Execution Flow

Let’s walk through what happens behind the scenes when you run a Spark job:

Spark Application Starts

User submits a job via Python, Scala, Java, or SQL.

The Driver initializes and begins the execution process.

Driver Translates Code into DAG

The code is broken into a Directed Acyclic Graph.

DAG is further split into stages, and each stage contains tasks.

Cluster Manager Allocates Executors

Cluster Manager provisions Executors on available machines.

Executors are launched on the worker nodes.

Executors Run Tasks in Parallel

Tasks are distributed across Executors.

Results are stored in memory for speed or written to disk if needed.

Results Returned to Driver

Completed task results are sent back to the Driver.

The Driver aggregates and presents the final result to the user.

Quick Breakdown of the Flow:

User submits a Spark job (Python, Scala, Java, or SQL).

The Driver builds a DAG and submits jobs to the Cluster Manager.

Cluster Manager assigns Executors on available Worker Nodes.

Executors process tasks in parallel and store results in memory or disk.

Final results are returned to the Driver and shown to the user.

💡 Did You Know?

• Executors are short-lived or long-running depending on your cluster mode—batch jobs vs streaming.

• You can monitor Spark jobs using the Spark UI, which shows DAGs, stages, tasks, memory usage, and more.

• Spark’s Fault Tolerance comes from re-executing failed tasks based on lineage information in RDDs.

📚 Study Notes

Master Worker Architecture

• Driver = Master → Coordinates the Spark job, maintains metadata, and converts code into a DAG (Directed Acyclic Graph) of stages & tasks.

• Cluster Manager = Resource Dispatcher → Allocates resources (CPU, memory) across the cluster. Can be Standalone, YARN, Mesos, or Kubernetes.

• Executors = Workers → Run on worker nodes; they execute tasks and store data in memory/disk.

• Spark translates user code → DAG → Optimized Physical Plan → stages → tasks → parallel execution.

• Spark supports multiple cluster managers, making it highly portable across environments.

How Spark executes the job –

• User submits a job (Eg – PySpark code).

• Driver converts the job into a DAG (Stages & Tasks).

• Cluster Manager allocates resources (Executors on Worker Nodes).

• Executors process tasks in parallel (Processing Partitions of Data).

• Executors send results back to Driver (Final Output).

Navigation

Apache Spark

Leave a Reply Cancel reply