Creating a SparkSession in PySpark
Creating a SparkSession correctly is fundamental to every Spark application, as it serves as your connection point to Spark's functionality—understanding how to configure SparkSession for different scenarios enables you to optimize performance, control resource allocation, and adapt to various deployment environments—mastering SparkSession configuration transforms you from a beginner who just gets things working to an intermediate practitioner who tunes applications for specific needs.
Basic SparkSession Creation
The Simplest Form
from pyspark .sql import SparkSession
spark = SparkSession .builder
.appName ( " MyApplication " )
.getOrCreate ()
# Use spark
df = spark .read .csv ( "data.csv" , header = True )
df .show ()
# Always stop
spark .stop ()
What happens:
SparkSession with Configuration
Memory Configuration
spark = SparkSession .builder
.appName ( " DataProcessing " )
.config ( " spark.driver .memory " , "2g" )
.config ( " spark.executor .memory " , "4g" )
.getOrCreate ()
Explanation:
Example scenarios:
CPU Core Configuration
spark = SparkSession .builder
.appName ( " DataProcessing " )
.config ( " spark.executor .cores " , "4" )
.config ( " spark.driver .cores " , "2" )
.getOrCreate ()
Explanation:
Practical values:
Shuffle and Partition Configuration
spark = SparkSession .builder
.appName ( " DataProcessing " )
.config ( " spark.sql.shuffle .partitions " , "200" )
.getOrCreate ()
Explanation:
Example:
# For laptop development
spark = SparkSession .builder
.appName ( "Dev" )
.config ( " spark.sql.shuffle .partitions " , "4" )
.getOrCreate ()
# For production cluster
spark = SparkSession .builder
.appName ( "Production" )
.config ( " spark.sql.shuffle .partitions " , "500" )
.getOrCreate ()
Master Configuration
Local Mode (Development)
# Single core
spark = SparkSession .builder
.master ( "local" )
.appName ( " MyApp " )
.getOrCreate ()
# Multiple cores
spark = SparkSession .builder
.master ( "local" )
.appName ( " MyApp " )
.getOrCreate ()
# All available cores
spark = SparkSession .builder
.master ( " local[ *]" )
.appName ( " MyApp " )
.getOrCreate ()
Standalone Cluster
spark = SparkSession .builder
.master ( "spark://master-hostname:7077" )
.appName ( " MyApp " )
.getOrCreate ()
YARN Cluster (Hadoop)
spark = SparkSession .builder
.master ( "yarn" )
.appName ( " MyApp " )
.getOrCreate ()
Kubernetes
spark = SparkSession .builder
.master ( "k8s://https://kubernetes-api:443" )
.appName ( " MyApp " )
.getOrCreate ()
Complete Practical Example
Development SparkSession
from pyspark .sql import SparkSession
def create_dev_ spark ( ):
"""Create Spark session for development"""
spark = SparkSession .builder
.appName ( " DevelopmentApp " )
.master ( "local" )
.config ( " spark.driver .memory " , "2g" )
.config ( " spark.sql.shuffle .partitions " , "4" )
.config ( " spark.sql.adaptive .enabled " , "true" )
.getOrCreate ()
return spark
spark = create_dev_ spark ( )
print ( f "Spark Version: { spark .version } " )
print ( f "App Name: { spark .sparkContext .appName } " )
# Use it
df = spark .read .csv ( "data.csv" , header = True )
df .show ()
spark .stop ()
Production SparkSession
def create_prod_ spark ( ):
"""Create Spark session for production"""
spark = SparkSession .builder
.appName ( " ProductionApp " )
.master ( "yarn" )
.config ( " spark.driver .memory " , "4g" )
.config ( " spark.executor .memory " , "8g" )
.config ( " spark.executor .cores " , "4" )
.config ( " spark.sql.shuffle .partitions " , "500" )
.config ( " spark.sql.adaptive .enabled " , "true" )
.config ( " spark.dynamicAllocation.enabled " , "true" )
.config ( " spark.dynamicAllocation.minExecutors " , "5" )
.config ( " spark.dynamicAllocation.maxExecutors " , "100" )
.getOrCreate ()
return spark
spark = create_prod_ spark ( )
# Production processing
spark .stop ()
Getting Information from SparkSession
Accessing Configuration
spark = SparkSession .builder .appName ( " MyApp " ).getOrCreate ()
# Get specific config
shuffle_partitions = spark .conf .get ( " spark.sql.shuffle .partitions " )
print ( f "Shuffle partitions: { shuffle_partitions } " )
# Get all configs
all_configs = spark .sparkContext .getConf ( ).getAll ()
for key , value in all_configs :
print ( f " { key } = { value } " )
# Get app name
app_name = spark .sparkContext .appName
print ( f "App name: { app_name } " )
# Get Spark version
version = spark .version
print ( f "Spark version: { version } " )
# Get master
master = spark .sparkContext .master
print ( f "Master : { master } " )
Common Patterns
Pattern 1: Function to Create Session
def get_spark_ session ( app_name , master = None, memory = "2g" , cores = 4 ):
"""Generic function to create SparkSession """
builder = SparkSession .builder .appName ( app_name )
if master :
builder = builder .master ( master )
builder = builder
.config ( " spark.driver .memory " , memory )
.config ( " spark.executor .cores " , cores )
return builder .getOrCreate ()
# Usage
spark = get_spark_ session ( " MyApp " , master = " local[ *]" , memory = "2g" , cores = 4 )
Pattern 2: Jupyter Notebook
# SparkSession often exists automatically in Jupyter
# But you can create one explicitly
from pyspark .sql import SparkSession
# In Jupyter , spark variable might already exist
# But create explicitly to be safe
spark = SparkSession .builder
.appName ( " NotebookApp " )
.config ( " spark.sql.shuffle .partitions " , "50" )
.getOrCreate ()
# Work with data
df = spark .read .csv ( "data.csv" , header = True )
df .display () # Databricks notebooks
# No need to stop in notebooks (persists for session)
Pattern 3: Production Job
# main_job.py
import sys
from pyspark .sql import SparkSession
def main ( ):
# Create session
spark = SparkSession .builder
.appName ( f "Job _ { sys .argv } " )
.getOrCreate ()
try :
# Do work
df = spark .read .parquet ( "input/" )
result = df .filter ( df .age > 25 )
result .write .parquet ( "output/" )
print ( "Job succeeded" )
except Exception as e :
print ( f "Job failed: { e } " )
sys .exit ( 1 )
finally :
spark .stop ()
if __name__ == "__main__" :
main ( )
💡 Best Practices
1. Always use builder pattern: Cleaner and more flexible than SparkContext
2. Set appName : Makes jobs identifiable in Spark UI
3. Configure based on environment: Dev vs Prod needs different settings
4. Always call stop( ): Especially in scripts, less critical in notebooks
5. Handle exceptions: Wrap code in try – except – finally
6. Don't create multiple sessions: Use getOrCreate ( ) to reuse
📚 Study Notes
• SparkSession.builder : Entry point for creating sessions
• .appName (): Name your application (appears in Spark UI)
• .master (): Set execution mode (local, yarn, spark://, k8s://)
• .config( ): Add configuration settings
• .getOrCreate (): Get existing or create new session
• .stop (): Shutdown and release resources
• Dev config: Local mode, small memory, small partition count
• Prod config: YARN/K8s, large memory, large partition count, dynamic allocation
• Access config: spark.conf.get ( ), spark.sparkContext.getConf ()