Creating a SparkSession in PySpark

Creating a SparkSession correctly is fundamental to every Spark application, as it serves as your connection point to Spark's functionality—understanding how to configure SparkSession for different scenarios enables you to optimize performance, control resource allocation, and adapt to various deployment environments—mastering SparkSession configuration transforms you from a beginner who just gets things working to an intermediate practitioner who tunes applications for specific needs.

Basic SparkSession Creation

The Simplest Form

from pyspark .sql import SparkSession

spark = SparkSession .builder

.appName ( " MyApplication " )

.getOrCreate ()

# Use spark

df = spark .read .csv ( "data.csv" , header = True )

df .show ()

# Always stop

spark .stop ()

What happens:

.builder : Starts the builder pattern

.appName (): Names your application (appears in Spark UI)

.getOrCreate (): Returns existing session if one exists, creates new otherwise

.stop (): Releases all resources

SparkSession with Configuration

Memory Configuration

spark = SparkSession .builder

.appName ( " DataProcessing " )

.config ( " spark.driver .memory " , "2g" )

.config ( " spark.executor .memory " , "4g" )

.getOrCreate ()

Explanation:

spark.driver .memory : How much RAM the Driver JVM gets (default 1g)

spark.executor .memory : How much RAM each Executor JVM gets (default 1g)

Example scenarios:

Large dataset: Increase to 8g or more

Limited laptop: Keep at 1-2g

Production cluster: 8-16g per executor

CPU Core Configuration

spark = SparkSession .builder

.appName ( " DataProcessing " )

.config ( " spark.executor .cores " , "4" )

.config ( " spark.driver .cores " , "2" )

.getOrCreate ()

Explanation:

spark.executor .cores : Number of CPU cores per executor

spark.driver .cores : Number of CPU cores for driver

Practical values:

Laptop (4 cores total): executor=2, driver=1

Server (16 cores): executor=4, driver=2

Large cluster: executor=8, driver=4

Shuffle and Partition Configuration

spark = SparkSession .builder

.appName ( " DataProcessing " )

.config ( " spark.sql.shuffle .partitions " , "200" )

.getOrCreate ()

Explanation:

spark.sql.shuffle .partitions : Number of partitions after shuffle operations

Default 200 is for large clusters

Reduce to 50-100 for small data/laptops

Increase to 500+ for massive datasets

Example:

# For laptop development

spark = SparkSession .builder

.appName ( "Dev" )

.config ( " spark.sql.shuffle .partitions " , "4" )

.getOrCreate ()

# For production cluster

spark = SparkSession .builder

.appName ( "Production" )

.config ( " spark.sql.shuffle .partitions " , "500" )

.getOrCreate ()

Master Configuration

Local Mode (Development)

# Single core

spark = SparkSession .builder

.master ( "local" )

.appName ( " MyApp " )

.getOrCreate ()

# Multiple cores

spark = SparkSession .builder

.master ( "local" )

.appName ( " MyApp " )

.getOrCreate ()

# All available cores

spark = SparkSession .builder

.master ( " local[ *]" )

.appName ( " MyApp " )

.getOrCreate ()

Standalone Cluster

spark = SparkSession .builder

.master ( "spark://master-hostname:7077" )

.appName ( " MyApp " )

.getOrCreate ()

YARN Cluster (Hadoop)

spark = SparkSession .builder

.master ( "yarn" )

.appName ( " MyApp " )

.getOrCreate ()

Kubernetes

spark = SparkSession .builder

.master ( "k8s://https://kubernetes-api:443" )

.appName ( " MyApp " )

.getOrCreate ()

Complete Practical Example

Development SparkSession

from pyspark .sql import SparkSession

def create_dev_ spark ( ):

"""Create Spark session for development"""

spark = SparkSession .builder

.appName ( " DevelopmentApp " )

.master ( "local" )

.config ( " spark.driver .memory " , "2g" )

.config ( " spark.sql.shuffle .partitions " , "4" )

.config ( " spark.sql.adaptive .enabled " , "true" )

.getOrCreate ()

return spark

spark = create_dev_ spark ( )

print ( f "Spark Version: { spark .version } " )

print ( f "App Name: { spark .sparkContext .appName } " )

# Use it

df = spark .read .csv ( "data.csv" , header = True )

df .show ()

spark .stop ()

Production SparkSession

def create_prod_ spark ( ):

"""Create Spark session for production"""

spark = SparkSession .builder

.appName ( " ProductionApp " )

.master ( "yarn" )

.config ( " spark.driver .memory " , "4g" )

.config ( " spark.executor .memory " , "8g" )

.config ( " spark.executor .cores " , "4" )

.config ( " spark.sql.shuffle .partitions " , "500" )

.config ( " spark.sql.adaptive .enabled " , "true" )

.config ( " spark.dynamicAllocation.enabled " , "true" )

.config ( " spark.dynamicAllocation.minExecutors " , "5" )

.config ( " spark.dynamicAllocation.maxExecutors " , "100" )

.getOrCreate ()

return spark

spark = create_prod_ spark ( )

# Production processing

spark .stop ()

Getting Information from SparkSession

Accessing Configuration

spark = SparkSession .builder .appName ( " MyApp " ).getOrCreate ()

# Get specific config

shuffle_partitions = spark .conf .get ( " spark.sql.shuffle .partitions " )

print ( f "Shuffle partitions: { shuffle_partitions } " )

# Get all configs

all_configs = spark .sparkContext .getConf ( ).getAll ()

for key , value in all_configs :

print ( f " { key } = { value } " )

# Get app name

app_name = spark .sparkContext .appName

print ( f "App name: { app_name } " )

# Get Spark version

version = spark .version

print ( f "Spark version: { version } " )

# Get master

master = spark .sparkContext .master

print ( f "Master : { master } " )

Common Patterns

Pattern 1: Function to Create Session

def get_spark_ session ( app_name , master = None, memory = "2g" , cores = 4 ):

"""Generic function to create SparkSession """

builder = SparkSession .builder .appName ( app_name )

if master :

builder = builder .master ( master )

builder = builder

.config ( " spark.driver .memory " , memory )

.config ( " spark.executor .cores " , cores )

return builder .getOrCreate ()

# Usage

spark = get_spark_ session ( " MyApp " , master = " local[ *]" , memory = "2g" , cores = 4 )

Pattern 2: Jupyter Notebook

# SparkSession often exists automatically in Jupyter

# But you can create one explicitly

from pyspark .sql import SparkSession

# In Jupyter , spark variable might already exist

# But create explicitly to be safe

spark = SparkSession .builder

.appName ( " NotebookApp " )

.config ( " spark.sql.shuffle .partitions " , "50" )

.getOrCreate ()

# Work with data

df = spark .read .csv ( "data.csv" , header = True )

df .display () # Databricks notebooks

# No need to stop in notebooks (persists for session)

Pattern 3: Production Job

# main_job.py

import sys

from pyspark .sql import SparkSession

def main ( ):

# Create session

spark = SparkSession .builder

.appName ( f "Job _ { sys .argv } " )

.getOrCreate ()

try :

# Do work

df = spark .read .parquet ( "input/" )

result = df .filter ( df .age > 25 )

result .write .parquet ( "output/" )

print ( "Job succeeded" )

except Exception as e :

print ( f "Job failed: { e } " )

sys .exit ( 1 )

finally :

spark .stop ()

if __name__ == "__main__" :

main ( )

💡 Best Practices

1. Always use builder pattern: Cleaner and more flexible than SparkContext

2. Set appName : Makes jobs identifiable in Spark UI

3. Configure based on environment: Dev vs Prod needs different settings

4. Always call stop( ): Especially in scripts, less critical in notebooks

5. Handle exceptions: Wrap code in try – except – finally

6. Don't create multiple sessions: Use getOrCreate ( ) to reuse

📚 Study Notes

• SparkSession.builder : Entry point for creating sessions

• .appName (): Name your application (appears in Spark UI)

• .master (): Set execution mode (local, yarn, spark://, k8s://)

• .config( ): Add configuration settings

• .getOrCreate (): Get existing or create new session

• .stop (): Shutdown and release resources

• Dev config: Local mode, small memory, small partition count

• Prod config: YARN/K8s, large memory, large partition count, dynamic allocation

• Access config: spark.conf.get ( ), spark.sparkContext.getConf ()

Navigation

Apache Spark Practicals

Leave a Reply Cancel reply