Installing Spark and Running the First Job
Your first Spark job is a critical milestone that confirms your installation is correct and introduces you to the complete workflow from development to execution—executing this job successfully validates your environment setup and demonstrates the power of distributed data processing—this practical hands-on experience will cement your understanding and build confidence in Spark development.
Prerequisites Verification
Before running your first job, verify everything is installed correctly:
Check Java Installation
java – version
# Output should show Java 8+
# Example: openjdk version "11.0.15"
echo $ JAVA_HOME
# Should show your Java installation path
# Example: / usr /lib/ jvm /java-11-openjdk-amd64
Check Spark Installation
echo $ SPARK_HOME
# Should show your Spark installation path
# Example: /home/user/spark/spark-3.5.0-bin-hadoop3
spark – shell — version
# Should show Spark version
# Example: Spark 3.5.0
Check Python Installation (For PySpark )
python3 — version
# Should show Python 3.7+
# Example: Python 3.10.12
pip3 list | grep pyspark
# Should show pyspark installed (if installed via pip)
# Example: pyspark 3.5.0
Your First Spark Job: Complete Walkthrough
Step 1: Create Sample Data
Create a simple CSV file for our first job:
# Create a directory for our work
mkdir – p ~/ spark – practicals
cd ~/ spark – practicals
# Create sample data
cat > people .csv << 'EOF'
id , name , age , city , salary
1 , Alice , 28 , New York , 75000
2 , Bob , 32 , San Francisco , 85000
3 , Charlie , 25 , New York , 65000
4 , Diana , 35 , Boston , 90000
5 , Eve , 29 , San Francisco , 80000
6 , Frank , 31 , New York , 88000
7 , Grace , 26 , Boston , 70000
8 , Henry , 33 , San Francisco , 95000
EOF
# Verify the file was created
cat people .csv
Step 2: Create Your First Spark Script
Create a Python file first_job.py:
# first_job.py
from pyspark .sql import SparkSession
# Create SparkSession
spark = SparkSession .builder
.appName ( " FirstSparkJob " )
.config ( " spark.sql.shuffle .partitions " , "4" )
.getOrCreate ()
# Log the Spark version
print ( f "Spark Version: { spark .version } " )
try :
# Read CSV file into DataFrame
df = spark .read .csv ( "people.csv" , header = True , inferSchema = True )
# Show the DataFrame
print ( " n === Original Data ===" )
df .show ()
# Print schema
print ( " n === Schema ===" )
df .printSchema ()
# Get row count
print ( f " n === Row Count ===" )
print ( f "Total rows: { df .count () } " )
# Filter: ages greater than 28
print ( " n === People aged > 28 ===" )
filtered_df = df .filter ( df .age > 28 )
filtered_ df .show ()
# Get statistics
print ( " n === Salary Statistics ===" )
df .select ( "salary" ).describe ( ).show ()
# Group by city
print ( " n === Count by City ===" )
df .groupBy ( "city" ).count ( ).show ()
print ( " n Job completed successfully!" )
except Exception as e :
print ( f "Error : { e } " )
import traceback
traceback .print_ exc ( )
finally :
# Always stop the Spark session
spark .stop ()
print ( "Spark session stopped." )
Step 3: Run Your First Job
Option A: Using spark-submit (Recommended)
spark – submit first_job .py
# Output will show:
# Spark Version: 3.5.0
# === Original Data ===
# +—+——-+—+————-+——+
# | id| name|age | city|salary |
# +—+——-+—+————-+——+
# | 1| Alice | 28| New York| 75000|
# | 2 | Bob| 32|San Francisco| 85000|
# …
Option B: Using python3 directly
python3 first_job .py
# Same output as above
Complete Job Output Explanation
When your job runs successfully, you'll see:
Spark Version : 3.5 .0
=== Original Data ===
+—+——-+—+————-+——+
| id | name | age | city | salary |
+—+——-+—+————-+——+
| 1 | Alice | 28 | New York | 75000 |
| 2 | Bob | 32 | San Francisco | 85000 |
| 3 | Charlie | 25 | New York | 65000 |
| 4 | Diana | 35 | Boston | 90000 |
| 5 | Eve | 29 | San Francisco | 80000 |
| 6 | Frank | 31 | New York | 88000 |
| 7 | Grace | 26 | Boston | 70000 |
| 8 | Henry | 33 | San Francisco | 95000 |
+—+——-+—+————-+——+
=== Schema ===
root
|– id : integer ( nullable = true )
|– name: string ( nullable = true )
|– age: integer ( nullable = true )
|– city: string ( nullable = true )
|– salary: integer ( nullable = true )
=== Row Count ===
Total rows : 8
=== People aged > 28 ===
+—+—-+—+————-+——+
| id | name | age | city | salary |
+—+—-+—+————-+——+
| 2 | Bob | 32 | San Francisco | 85000 |
| 4 | Diana | 35 | Boston | 90000 |
| 5 | Eve | 29 | San Francisco | 80000 |
| 6 | Frank | 31 | New York | 88000 |
| 8 | Henry | 33 | San Francisco | 95000 |
+—+—-+—+————-+——+
=== Salary Statistics ===
+——-+—–+——+——+——+
| summary | salary | | | |
+——-+—–+——+——+——+
| count | 8 | | | |
| mean | 81375 | | | |
| stddev | 10220.06 …| |
| min | 65000 | | | |
| max | 95000 | | | |
+——-+—–+——+——+——+
=== Count by City ===
+————-+—–+
| city | count |
+————-+—–+
| New York | 3 |
| Boston | 2 |
| San Francisco | 3 |
+————-+—–+
✅ Job completed successfully!
Spark session stopped .
Understanding What Happened
What Your Job Did
Execution Flow
Start
↓
Create SparkSession (initialize Spark)
↓
Read people.csv (load into DataFrame )
↓
Operations (filter, count, groupBy , etc.)
↓
Show results (display to console)
↓
Stop SparkSession (cleanup)
↓
End
Common Issues and Fixes
Issue 1: "Spark command not found"
Error:
spark – submit: command not found
Fix:
# Ensure SPARK_HOME is set
echo $ SPARK_HOME
# If empty, set it
export SPARK_HOME =/ path / to / spark
export PATH =$ SPARK_HOME / bin :$ PATH
# Try again
spark – submit first_job .py
Issue 2: "Java not found"
Error:
Error : JAVA_HOME is not set and could not be found
Fix:
# Find Java
which java
# / usr /bin/java
# Set JAVA_HOME (for Linux)
export JAVA_HOME =/ usr / lib / jvm / java – 11 – openjdk – amd64
# Verify
echo $ JAVA_HOME
Issue 3: "File not found" when reading CSV
Error:
No such file or directory: people .csv
Fix:
# Make sure you're in the right directory
pwd # Should show ~/spark- practicals
# Make sure people.csv exists
ls – la people .csv
# If running from different directory, use full path
spark – submit — driver – class – path . first_job .py
Issue 4: "Module not found: pyspark "
Error:
ModuleNotFoundError : No module named ' pyspark '
Fix:
# Install pyspark
pip3 install pyspark
# Or use spark installation's python
$ SPARK_HOME / bin / pyspark
Monitoring Your Job
Spark Web UI
While your job is running, you can monitor it:
# Your job automatically opens a Spark UI
# Open in browser: http://localhost:4040
# You can see:
# – Job status and progress
# – Stage information
# – Executor metrics
# – Memory and CPU usage
Console Output
The job prints progress to console:
# Task execution
[Stage 0:> (0 + 4) / 4]
# Job completion
[Stage 0:= ==>
Next Steps After Your First Job
Try These Enhancements
1. Add more operations:
# Add to your script
df_sorted = df .sort ( "salary" , ascending = False )
print ( "Highest salaries first:" )
df_ sorted .show ( 3 )
2. Save results:
# Save filtered results to new file
filtered_df .write .csv ( "output/ people_filtered " , header = True )
print ( "Results saved to output/ people_filtered " )
3. Increase data volume:
# Create larger CSV with more rows
python3 << 'EOF'
import random
names = [ "Alice" , "Bob" , "Charlie" , "Diana" , "Eve" , "Frank" , "Grace" , "Henry" ]
cities = [ "New York" , "San Francisco" , "Boston" , "Chicago" , "Denver" ]
with open ( "people_large.csv" , "w" ) as f:
f .write ( " id,name , age,city ,salary n " )
for i in range ( 1 , 1001 ) :
name = random .choice ( names )
age = random .randint ( 20 , 65 )
city = random .choice ( cities )
salary = 50000 + random .randint ( 0 , 50000 )
f .write ( f " { i } , { name } , { age } , { city } , { salary } n " )
print ( "Created people_large.csv with 1000 rows" )
EOF
💡 Key Learnings from Your First Job
• SparkSession is your entry point: All Spark operations start here
• DataFrames are the data structure: Hold and manipulate your data
• Operations are lazy: Nothing happens until you call show( ), count( ), etc.
• Spark handles the distribution: Write code for a single machine, Spark distributes
• Always stop your session: Releases resources when done
• Error handling is important: Try- except blocks protect against failures
• Logging is helpful: Print statements help debug and monitor
📚 Study Notes
• First job validates: Environment, installation, and basic functionality
• SparkSession creation : .builder .appName ().config( ).getOrCreate ()
• Reading CSV: spark.read.csv( filename, header =True, inferSchema =True)
• Basic operations : .show () , .count () , .printSchema () , .filter () , .groupBy ()
• Always stop: spark.stop () to release resources
• Error handling: Use try – except – finally pattern for robustness
• Execution modes: Local mode (default) vs cluster modes (YARN, Standalone, K8s)
• Spark Web UI: Available at localhost:4040 while job runs
• File paths: Can be local or distributed (HDFS, S3, etc.)