History of Apache Spark

 

Before Spark, there was MapReduce—a powerful but clunky model pioneered by Google and later implemented in Apache Hadoop. It worked, but had a fatal flaw: it was slow.

Why? Because every step required disk I/O.

Imagine cooking a meal, but having to go back to the fridge for every single ingredient.

 

In 2009, a group of researchers at UC Berkeley’s AMPLab had enough. They created Apache Spark to fix this.

It introduced:

  • In-memory computing
  • Simpler APIs
  • Iterative processing support
  •  

    By 2013, Spark joined the Apache Foundation. Soon after, its creators launched Databricks, which still leads Spark’s development today.

     

    Spark quickly became the de facto big data processing engine, unifying diverse workloads in a single framework.

     

    📚 Timeline

  • 2009 — Spark is born at UC Berkeley's AMPLab (formerly RAD Lab) to overcome Hadoop's limitations.
  • 2010 — Spark is open-sourced for the first time.
  • 2013 — Spark is donated to the Apache Software Foundation (ASF), becoming an Apache top-level project.
  • 2014
  • Apache Spark 1.0 is released.
  • Databricks is founded by Spark's original creators to commercialize and support Spark.
  • 2015
  • Spark becomes the most active Apache project.
  • Spark 1.5 introduces Tungsten — major optimizations for memory and CPU.
  • 2016 — Spark 2.0 is released with a unified DataFrame and Dataset API and structured streaming.
  • 2018 — Spark 2.3 brings continuous streaming and Kubernetes support.
  • 2020 — Spark 3.0 is released with Adaptive Query Execution (AQE), dynamic partition pruning, and better Python (Pandas) integration.
  • 2022 — Spark 3.2 and 3.3 released with enhanced features for streaming, connectors, and Python UDF support.
  • 2024+ — Spark continues evolving with a focus on **cloud-native execution**, **Delta Lake integration**, and **ML/AI workloads**.
  •  

    💡 Fun Facts

    Apache Spark was originally written in Scala and named “Spark” to signify lightning-fast processing.

    Its creators wanted a system that was not only powerful but also easy to use — that's why it offers APIs in Scala, Python, Java, R, and SQL!

    📚 Study Notes

    Spark’s Evolution at UC Berkeley

    •   In 2009, researchers at AMPLab (formerly RAD Lab) at UC Berkeley created Spark to overcome Hadoop's limitations.

    •   Spark introduced in-memory computing, simplified APIs, and faster iterative processing, making it 10–20x faster than MapReduce.

    •   By 2013, Spark was donated to Apache Software Foundation (ASF), and its creators launched Databricks, further driving its development.

    •   Apache Spark 1.0 was released in 2014, with continuous improvements contributed by Databricks and the open-source community.

     

     

     

    Leave a Reply