Installing Spark and Running the First Job

 

Your first Spark job is a critical milestone that confirms your installation is correct and introduces you to the complete workflow from development to execution—executing this job successfully validates your environment setup and demonstrates the power of distributed data processing—this practical hands-on experience will cement your understanding and build confidence in Spark development.

 

Prerequisites Verification

Before running your first job, verify everything is installed correctly:

Check Java Installation

java version

# Output should show Java 8+

# Example: openjdk version "11.0.15"

 

echo $ JAVA_HOME

# Should show your Java installation path

# Example: / usr /lib/ jvm /java-11-openjdk-amd64

 

Check Spark Installation

echo $ SPARK_HOME

# Should show your Spark installation path

# Example: /home/user/spark/spark-3.5.0-bin-hadoop3

 

spark shell version

# Should show Spark version

# Example: Spark 3.5.0

 

 

Check Python Installation (For PySpark )

python3 version

# Should show Python 3.7+

# Example: Python 3.10.12

 

pip3 list | grep pyspark

# Should show pyspark installed (if installed via pip)

# Example: pyspark  3.5.0

 

 

 

Your First Spark Job: Complete Walkthrough

Step 1: Create Sample Data

Create a simple CSV file for our first job:

# Create a directory for our work

mkdir p ~/ spark practicals

cd ~/ spark practicals

 

# Create sample data

cat > people .csv << 'EOF'

id , name , age , city , salary

1 , Alice , 28 , New York , 75000

2 , Bob , 32 , San Francisco , 85000

3 , Charlie , 25 , New York , 65000

4 , Diana , 35 , Boston , 90000

5 , Eve , 29 , San Francisco , 80000

6 , Frank , 31 , New York , 88000

7 , Grace , 26 , Boston , 70000

8 , Henry , 33 , San Francisco , 95000

EOF

 

# Verify the file was created

cat people .csv

 

 

Step 2: Create Your First Spark Script

Create a Python file first_job.py:

# first_job.py

from pyspark .sql import SparkSession

 

# Create SparkSession

spark = SparkSession .builder

    .appName ( " FirstSparkJob " )

    .config ( " spark.sql.shuffle .partitions " , "4" )

    .getOrCreate ()

 

# Log the Spark version

print ( f "Spark Version: { spark .version } " )

 

try :

    # Read CSV file into DataFrame

    df = spark .read .csv ( "people.csv" , header = True , inferSchema = True )

   

    # Show the DataFrame

    print ( " n === Original Data ===" )

    df .show ()

   

    # Print schema

    print ( " n === Schema ===" )

    df .printSchema ()

   

    # Get row count

    print ( f " n === Row Count ===" )

    print ( f "Total rows: { df .count () } " )

   

    # Filter: ages greater than 28

    print ( " n === People aged > 28 ===" )

    filtered_df = df .filter ( df .age > 28 )

    filtered_ df .show ()

   

    # Get statistics

    print ( " n === Salary Statistics ===" )

    df .select ( "salary" ).describe ( ).show ()

   

    # Group by city

    print ( " n === Count by City ===" )

    df .groupBy ( "city" ).count ( ).show ()

   

    print ( " n Job completed successfully!" )

 

except Exception as e :

    print ( f "Error : { e } " )

    import traceback

    traceback .print_ exc ( )

 

finally :

    # Always stop the Spark session

    spark .stop ()

    print ( "Spark session stopped." )

 

 

Step 3: Run Your First Job

Option A: Using spark-submit (Recommended)

spark submit first_job .py

 

# Output will show:

# Spark Version: 3.5.0

# === Original Data ===

# +—+——-+—+————-+——+

# | id|   name|age |         city|salary |

# +—+——-+—+————-+——+

# |  1|  Alice | 28|     New York| 75000|

# |  2 |    Bob| 32|San Francisco| 85000|

# …

 

 

Option B: Using python3 directly

python3 first_job .py

 

# Same output as above

 

 

 

Complete Job Output Explanation

When your job runs successfully, you'll see:

Spark Version : 3.5 .0

 

=== Original Data ===

+—+——-+—+————-+——+

| id |   name | age |         city | salary |

+—+——-+—+————-+——+

|  1 |  Alice | 28 |     New York | 75000 |

|  2 |    Bob | 32 | San Francisco | 85000 |

|  3 | Charlie | 25 |     New York | 65000 |

|  4 | Diana | 35 |       Boston | 90000 |

|  5 |   Eve | 29 | San Francisco | 80000 |

|  6 | Frank | 31 |     New York | 88000 |

|  7 | Grace | 26 |       Boston | 70000 |

|  8 | Henry | 33 | San Francisco | 95000 |

+—+——-+—+————-+——+

 

=== Schema ===

root

 |– id : integer ( nullable = true )

 |– name: string ( nullable = true )

 |– age: integer ( nullable = true )

 |– city: string ( nullable = true )

 |– salary: integer ( nullable = true )

 

=== Row Count ===

Total rows : 8

 

=== People aged > 28 ===

+—+—-+—+————-+——+

| id | name | age |         city | salary |

+—+—-+—+————-+——+

|  2 | Bob | 32 | San Francisco | 85000 |

|  4 | Diana | 35 |       Boston | 90000 |

|  5 | Eve | 29 | San Francisco | 80000 |

|  6 | Frank | 31 |     New York | 88000 |

|  8 | Henry | 33 | San Francisco | 95000 |

+—+—-+—+————-+——+

 

=== Salary Statistics ===

+——-+—–+——+——+——+

| summary | salary |      |      |      |

+——-+—–+——+——+——+

|  count |    8 |      |      |      |

|   mean | 81375 |      |      |      |

|   stddev | 10220.06 |    |

|    min | 65000 |      |      |      |

|    max | 95000 |      |      |      |

+——-+—–+——+——+——+

 

=== Count by City ===

+————-+—–+

|         city | count |

+————-+—–+

|     New York |    3 |

|       Boston |    2 |

| San Francisco |    3 |

+————-+—–+

 

Job completed successfully!

Spark session stopped .

 

 

Understanding What Happened

What Your Job Did

  • Created SparkSession : Connected to Spark cluster (local mode)
  • Read CSV file: Loaded people.csv into a DataFrame
  • Displayed data: Showed the 8 rows and 5 columns
  • Printed schema: Revealed column types (id=integer, name=string, etc.)
  • Counted rows: Showed total of 8 records
  • Filtered data: Found people aged > 28 (5 results)
  • Computed statistics: Calculated salary min, max, mean
  • Grouped data: Counted people by city
  • Stopped session: Cleaned up resources
  •  

    Execution Flow

    Start

      ↓

    Create SparkSession (initialize Spark)

      ↓

    Read people.csv (load into DataFrame )

      ↓

    Operations (filter, count, groupBy , etc.)

      ↓

    Show results (display to console)

      ↓

    Stop SparkSession (cleanup)

      ↓

    End

     

    Common Issues and Fixes

    Issue 1: "Spark command not found"

    Error:

    spark submit: command not found

    Fix:

    # Ensure SPARK_HOME is set

    echo $ SPARK_HOME

     

    # If empty, set it

    export SPARK_HOME =/ path / to / spark

    export PATH =$ SPARK_HOME / bin :$ PATH

     

    # Try again

    spark submit first_job .py

     

     

    Issue 2: "Java not found"

    Error:

    Error : JAVA_HOME is not set and could not be found

    Fix:

    # Find Java

    which java

    # / usr /bin/java

     

    # Set JAVA_HOME (for Linux)

    export JAVA_HOME =/ usr / lib / jvm / java 11 openjdk amd64

     

    # Verify

    echo $ JAVA_HOME

     

     

    Issue 3: "File not found" when reading CSV

    Error:

    No such file or directory: people .csv

    Fix:

    # Make sure you're in the right directory

    pwd  # Should show ~/spark- practicals

     

    # Make sure people.csv exists

    ls la people .csv

     

    # If running from different directory, use full path

    spark submit driver class path . first_job .py

     

     

    Issue 4: "Module not found: pyspark "

    Error:

    ModuleNotFoundError : No module named ' pyspark '

    Fix:

    # Install pyspark

    pip3 install pyspark

     

    # Or use spark installation's python

    $ SPARK_HOME / bin / pyspark

     

     

     

    Monitoring Your Job

    Spark Web UI

    While your job is running, you can monitor it:

    # Your job automatically opens a Spark UI

    # Open in browser: http://localhost:4040

     

    # You can see:

    # – Job status and progress

    # – Stage information

    # – Executor metrics

    # – Memory and CPU usage

     

     

    Console Output

    The job prints progress to console:

    # Task execution

    [Stage 0:> (0 + 4) / 4]

     

    # Job completion

    [Stage 0:= ==>

     

     

     

    Next Steps After Your First Job

    Try These Enhancements

    1. Add more operations:

    # Add to your script

    df_sorted = df .sort ( "salary" , ascending = False )

    print ( "Highest salaries first:" )

    df_ sorted .show ( 3 )

     

     

    2. Save results:

    # Save filtered results to new file

    filtered_df .write .csv ( "output/ people_filtered " , header = True )

    print ( "Results saved to output/ people_filtered " )

     

     

    3. Increase data volume:

    # Create larger CSV with more rows

    python3 << 'EOF'

    import random

     

    names = [ "Alice" , "Bob" , "Charlie" , "Diana" , "Eve" , "Frank" , "Grace" , "Henry" ]

    cities = [ "New York" , "San Francisco" , "Boston" , "Chicago" , "Denver" ]

     

    with open ( "people_large.csv" , "w" ) as f:

        f .write ( " id,name , age,city ,salary n " )

        for i in range ( 1 , 1001 ) :

            name = random .choice ( names )

            age = random .randint ( 20 , 65 )

            city = random .choice ( cities )

            salary = 50000 + random .randint ( 0 , 50000 )

            f .write ( f " { i } , { name } , { age } , { city } , { salary } n " )

     

    print ( "Created people_large.csv with 1000 rows" )

    EOF

     

     

    💡 Key Learnings from Your First Job

    •   SparkSession is your entry point: All Spark operations start here

    •   DataFrames are the data structure: Hold and manipulate your data

    •   Operations are lazy: Nothing happens until you call show( ), count( ), etc.

    •   Spark handles the distribution: Write code for a single machine, Spark distributes

    •   Always stop your session: Releases resources when done

    •   Error handling is important: Try- except blocks protect against failures

    •   Logging is helpful: Print statements help debug and monitor

     

     

    📚 Study Notes

    •   First job validates: Environment, installation, and basic functionality

    •   SparkSession creation : .builder .appName ().config( ).getOrCreate ()

    •   Reading CSV: spark.read.csv( filename, header =True, inferSchema =True)

    •   Basic operations : .show () , .count () , .printSchema () , .filter () , .groupBy ()

    •   Always stop: spark.stop () to release resources

    •   Error handling: Use try except finally pattern for robustness

    •   Execution modes: Local mode (default) vs cluster modes (YARN, Standalone, K8s)

    •   Spark Web UI: Available at localhost:4040 while job runs

    •   File paths: Can be local or distributed (HDFS, S3, etc.)

     

     

    Leave a Reply