History of Apache Spark
Before Spark, there was MapReduce—a powerful but clunky model pioneered by Google and later implemented in Apache Hadoop. It worked, but had a fatal flaw: it was slow.
Why? Because every step required disk I/O.
Imagine cooking a meal, but having to go back to the fridge for every single ingredient.
In 2009, a group of researchers at UC Berkeley’s AMPLab had enough. They created Apache Spark to fix this.
It introduced:
By 2013, Spark joined the Apache Foundation. Soon after, its creators launched Databricks, which still leads Spark’s development today.
Spark quickly became the de facto big data processing engine, unifying diverse workloads in a single framework.
📚 Timeline
💡 Fun Facts
Apache Spark was originally written in Scala and named “Spark” to signify lightning-fast processing.
Its creators wanted a system that was not only powerful but also easy to use — that's why it offers APIs in Scala, Python, Java, R, and SQL!
📚 Study Notes
Spark’s Evolution at UC Berkeley
• In 2009, researchers at AMPLab (formerly RAD Lab) at UC Berkeley created Spark to overcome Hadoop's limitations.
• Spark introduced in-memory computing, simplified APIs, and faster iterative processing, making it 10–20x faster than MapReduce.
• By 2013, Spark was donated to Apache Software Foundation (ASF), and its creators launched Databricks, further driving its development.
• Apache Spark 1.0 was released in 2014, with continuous improvements contributed by Databricks and the open-source community.