Installing Spark and Running the First Job

Your first Spark job is a critical milestone that confirms your installation is correct and introduces you to the complete workflow from development to execution—executing this job successfully validates your environment setup and demonstrates the power of distributed data processing—this practical hands-on experience will cement your understanding and build confidence in Spark development.

Prerequisites Verification

Before running your first job, verify everything is installed correctly:

Check Java Installation

java – version

# Output should show Java 8+

# Example: openjdk version "11.0.15"

echo $ JAVA_HOME

# Should show your Java installation path

# Example: / usr /lib/ jvm /java-11-openjdk-amd64

Check Spark Installation

echo $ SPARK_HOME

# Should show your Spark installation path

# Example: /home/user/spark/spark-3.5.0-bin-hadoop3

spark – shell — version

# Should show Spark version

# Example: Spark 3.5.0

Check Python Installation (For PySpark )

python3 — version

# Should show Python 3.7+

# Example: Python 3.10.12

pip3 list | grep pyspark

# Should show pyspark installed (if installed via pip)

# Example: pyspark 3.5.0

Your First Spark Job: Complete Walkthrough

Step 1: Create Sample Data

Create a simple CSV file for our first job:

# Create a directory for our work

mkdir – p ~/ spark – practicals

cd ~/ spark – practicals

# Create sample data

cat > people .csv << 'EOF'

id , name , age , city , salary

1 , Alice , 28 , New York , 75000

2 , Bob , 32 , San Francisco , 85000

3 , Charlie , 25 , New York , 65000

4 , Diana , 35 , Boston , 90000

5 , Eve , 29 , San Francisco , 80000

6 , Frank , 31 , New York , 88000

7 , Grace , 26 , Boston , 70000

8 , Henry , 33 , San Francisco , 95000

EOF

# Verify the file was created

cat people .csv

Step 2: Create Your First Spark Script

Create a Python file first_job.py:

# first_job.py

from pyspark .sql import SparkSession

# Create SparkSession

spark = SparkSession .builder

.appName ( " FirstSparkJob " )

.config ( " spark.sql.shuffle .partitions " , "4" )

.getOrCreate ()

# Log the Spark version

print ( f "Spark Version: { spark .version } " )

try :

# Read CSV file into DataFrame

df = spark .read .csv ( "people.csv" , header = True , inferSchema = True )

# Show the DataFrame

print ( " n === Original Data ===" )

df .show ()

# Print schema

print ( " n === Schema ===" )

df .printSchema ()

# Get row count

print ( f " n === Row Count ===" )

print ( f "Total rows: { df .count () } " )

# Filter: ages greater than 28

print ( " n === People aged > 28 ===" )

filtered_df = df .filter ( df .age > 28 )

filtered_ df .show ()

# Get statistics

print ( " n === Salary Statistics ===" )

df .select ( "salary" ).describe ( ).show ()

# Group by city

print ( " n === Count by City ===" )

df .groupBy ( "city" ).count ( ).show ()

print ( " n Job completed successfully!" )

except Exception as e :

print ( f "Error : { e } " )

import traceback

traceback .print_ exc ( )

finally :

# Always stop the Spark session

spark .stop ()

print ( "Spark session stopped." )

Step 3: Run Your First Job

Option A: Using spark-submit (Recommended)

spark – submit first_job .py

# Output will show:

# Spark Version: 3.5.0

# === Original Data ===

# +—+——-+—+————-+——+

# | id| name|age | city|salary |

# +—+——-+—+————-+——+

# | 1| Alice | 28| New York| 75000|

# | 2 | Bob| 32|San Francisco| 85000|

# …

Option B: Using python3 directly

python3 first_job .py

# Same output as above

Complete Job Output Explanation

When your job runs successfully, you'll see:

Spark Version : 3.5 .0

=== Original Data ===

+—+——-+—+————-+——+

+—+——-+—+————-+——+

| 1 | Alice | 28 | New York | 75000 |

| 2 | Bob | 32 | San Francisco | 85000 |

| 3 | Charlie | 25 | New York | 65000 |

| 4 | Diana | 35 | Boston | 90000 |

| 5 | Eve | 29 | San Francisco | 80000 |

| 6 | Frank | 31 | New York | 88000 |

| 7 | Grace | 26 | Boston | 70000 |

| 8 | Henry | 33 | San Francisco | 95000 |

+—+——-+—+————-+——+

=== Schema ===

root

|– id : integer ( nullable = true )

|– name: string ( nullable = true )

|– age: integer ( nullable = true )

|– city: string ( nullable = true )

|– salary: integer ( nullable = true )

=== Row Count ===

Total rows : 8

=== People aged > 28 ===

+—+—-+—+————-+——+

+—+—-+—+————-+——+

| 2 | Bob | 32 | San Francisco | 85000 |

| 4 | Diana | 35 | Boston | 90000 |

| 5 | Eve | 29 | San Francisco | 80000 |

| 6 | Frank | 31 | New York | 88000 |

| 8 | Henry | 33 | San Francisco | 95000 |

+—+—-+—+————-+——+

=== Salary Statistics ===

+——-+—–+——+——+——+

+——-+—–+——+——+——+

| count | 8 | | | |

| mean | 81375 | | | |

| stddev | 10220.06 …| |

| min | 65000 | | | |

| max | 95000 | | | |

+——-+—–+——+——+——+

=== Count by City ===

+————-+—–+

| city | count |

+————-+—–+

| New York | 3 |

| Boston | 2 |

| San Francisco | 3 |

+————-+—–+

✅ Job completed successfully!

Spark session stopped .

Understanding What Happened

What Your Job Did

Created SparkSession : Connected to Spark cluster (local mode)

Read CSV file: Loaded people.csv into a DataFrame

Displayed data: Showed the 8 rows and 5 columns

Printed schema: Revealed column types (id=integer, name=string, etc.)

Counted rows: Showed total of 8 records

Filtered data: Found people aged > 28 (5 results)

Computed statistics: Calculated salary min, max, mean

Grouped data: Counted people by city

Stopped session: Cleaned up resources

Execution Flow

Start

↓

Create SparkSession (initialize Spark)

↓

Read people.csv (load into DataFrame )

↓

Operations (filter, count, groupBy , etc.)

↓

Show results (display to console)

↓

Stop SparkSession (cleanup)

↓

End

Common Issues and Fixes

Issue 1: "Spark command not found"

Error:

spark – submit: command not found

Fix:

# Ensure SPARK_HOME is set

echo $ SPARK_HOME

# If empty, set it

export SPARK_HOME =/ path / to / spark

export PATH =$ SPARK_HOME / bin :$ PATH

# Try again

spark – submit first_job .py

Issue 2: "Java not found"

Error:

Error : JAVA_HOME is not set and could not be found

Fix:

# Find Java

which java

# / usr /bin/java

# Set JAVA_HOME (for Linux)

export JAVA_HOME =/ usr / lib / jvm / java – 11 – openjdk – amd64

# Verify

echo $ JAVA_HOME

Issue 3: "File not found" when reading CSV

Error:

No such file or directory: people .csv

Fix:

# Make sure you're in the right directory

pwd # Should show ~/spark- practicals

# Make sure people.csv exists

ls – la people .csv

# If running from different directory, use full path

spark – submit — driver – class – path . first_job .py

Issue 4: "Module not found: pyspark "

Error:

ModuleNotFoundError : No module named ' pyspark '

Fix:

# Install pyspark

pip3 install pyspark

# Or use spark installation's python

$ SPARK_HOME / bin / pyspark

Monitoring Your Job

Spark Web UI

While your job is running, you can monitor it:

# Your job automatically opens a Spark UI

# Open in browser: http://localhost:4040

# You can see:

# – Job status and progress

# – Stage information

# – Executor metrics

# – Memory and CPU usage

Console Output

The job prints progress to console:

# Task execution

[Stage 0:> (0 + 4) / 4]

# Job completion

[Stage 0:= ==>

Next Steps After Your First Job

Try These Enhancements

1. Add more operations:

# Add to your script

df_sorted = df .sort ( "salary" , ascending = False )

print ( "Highest salaries first:" )

df_ sorted .show ( 3 )

2. Save results:

# Save filtered results to new file

filtered_df .write .csv ( "output/ people_filtered " , header = True )

print ( "Results saved to output/ people_filtered " )

3. Increase data volume:

# Create larger CSV with more rows

python3 << 'EOF'

import random

names = [ "Alice" , "Bob" , "Charlie" , "Diana" , "Eve" , "Frank" , "Grace" , "Henry" ]

cities = [ "New York" , "San Francisco" , "Boston" , "Chicago" , "Denver" ]

with open ( "people_large.csv" , "w" ) as f:

f .write ( " id,name , age,city ,salary n " )

for i in range ( 1 , 1001 ) :

name = random .choice ( names )

age = random .randint ( 20 , 65 )

city = random .choice ( cities )

salary = 50000 + random .randint ( 0 , 50000 )

f .write ( f " { i } , { name } , { age } , { city } , { salary } n " )

print ( "Created people_large.csv with 1000 rows" )

EOF

💡 Key Learnings from Your First Job

• SparkSession is your entry point: All Spark operations start here

• DataFrames are the data structure: Hold and manipulate your data

• Operations are lazy: Nothing happens until you call show( ), count( ), etc.

• Spark handles the distribution: Write code for a single machine, Spark distributes

• Always stop your session: Releases resources when done

• Error handling is important: Try- except blocks protect against failures

• Logging is helpful: Print statements help debug and monitor

📚 Study Notes

• First job validates: Environment, installation, and basic functionality

• SparkSession creation : .builder .appName ().config( ).getOrCreate ()

• Reading CSV: spark.read.csv( filename, header =True, inferSchema =True)

• Basic operations : .show () , .count () , .printSchema () , .filter () , .groupBy ()

• Always stop: spark.stop () to release resources

• Error handling: Use try – except – finally pattern for robustness

• Execution modes: Local mode (default) vs cluster modes (YARN, Standalone, K8s)

• Spark Web UI: Available at localhost:4040 while job runs

• File paths: Can be local or distributed (HDFS, S3, etc.)

Navigation

Apache Spark Practicals

Leave a Reply Cancel reply