Spark Architecture

 

Apache Spark is designed to process big data by distributing tasks across multiple machines. Its architecture ensures high speed, scalability, and fault tolerance—making it a top choice for modern data systems.

 

Spark Follows a Master- Worker Architecture

Apache Spark processes large-scale data across distributed machines using a Master-Slave architecture.

The Driver coordinates the job, and Executors handle the actual execution in parallel across nodes.

 

At a high level, Spark consists of three key components:

  • Spark Driver (Master Node) –
  • Acts as the main coordinator of the Spark application.
  • Converts your code into a DAG (Directed Acyclic Graph).
  • Breaks the DAG into stages and tasks, and schedules them on executors.
  • Cluster Manager
  • Manages resources across the cluster.
  • Allocates Executors to run your tasks.
  • Spark supports multiple cluster managers:
  • YARN (Hadoop)
  • Kubernetes
  • Apache Mesos
  • Standalone mode
  • Executors (Worker Nodes)
  • Run the actual tasks in parallel.
  • Store intermediate results in memory for efficiency.
  • Continuously communicate with the Driver for task status and completion.
  •  

    Spark Architecture Overview

    8_spark_architecture_image_1.png

     

    Driver Program: Includes Spark Application, Driver, and Session. Submits the job and controls execution.

    Cluster Manager: Allocates resources and launches Executors on Worker Nodes.

    Worker Nodes (Executors): Run tasks and process data partitions in parallel.

    Task Flow: Data and transformations are sent to Executors; results flow back to Driver after execution.

     

    Step-by-Step: Spark Execution Flow

    Let’s walk through what happens behind the scenes when you run a Spark job:

  • Spark Application Starts
  • User submits a job via Python, Scala, Java, or SQL.
  • The Driver initializes and begins the execution process.
  • Driver Translates Code into DAG
  • The code is broken into a Directed Acyclic Graph.
  • DAG is further split into stages, and each stage contains tasks.
  • Cluster Manager Allocates Executors
  • Cluster Manager provisions Executors on available machines.
  • Executors are launched on the worker nodes.
  • Executors Run Tasks in Parallel
  • Tasks are distributed across Executors.
  • Results are stored in memory for speed or written to disk if needed.
  • Results Returned to Driver
  • Completed task results are sent back to the Driver.
  • The Driver aggregates and presents the final result to the user.
  •  

    Quick Breakdown of the Flow:

  • User submits a Spark job (Python, Scala, Java, or SQL).
  • The Driver builds a DAG and submits jobs to the Cluster Manager.
  • Cluster Manager assigns Executors on available Worker Nodes.
  • Executors process tasks in parallel and store results in memory or disk.
  • Final results are returned to the Driver and shown to the user.
  •  

    💡 Did You Know?

    •   Executors are short-lived or long-running depending on your cluster mode—batch jobs vs streaming.

    •   You can monitor Spark jobs using the Spark UI, which shows DAGs, stages, tasks, memory usage, and more.

    •   Spark’s Fault Tolerance comes from re-executing failed tasks based on lineage information in RDDs.

    📚 Study Notes

    Master Worker Architecture

    •   Driver = Master → Coordinates the Spark job, maintains metadata, and converts code into a DAG (Directed Acyclic Graph) of stages & tasks.

    •   Cluster Manager = Resource Dispatcher → Allocates resources (CPU, memory) across the cluster. Can be Standalone, YARN, Mesos, or Kubernetes.

    •   Executors = Workers → Run on worker nodes; they execute tasks and store data in memory/disk.

    •   Spark translates user code → DAG → Optimized Physical Plan → stages → tasks → parallel execution.

    •   Spark supports multiple cluster managers, making it highly portable across environments.

     

    How Spark executes the job –

    •   User submits a job (Eg – PySpark code).

    •   Driver converts the job into a DAG (Stages & Tasks).

    •   Cluster Manager allocates resources (Executors on Worker Nodes).

    •   Executors process tasks in parallel (Processing Partitions of Data).

    •   Executors send results back to Driver (Final Output).

     

     

     

    Leave a Reply