Creating a SparkSession in PySpark

 

Creating a SparkSession correctly is fundamental to every Spark application, as it serves as your connection point to Spark's functionality—understanding how to configure SparkSession for different scenarios enables you to optimize performance, control resource allocation, and adapt to various deployment environments—mastering SparkSession configuration transforms you from a beginner who just gets things working to an intermediate practitioner who tunes applications for specific needs.

 

Basic SparkSession Creation

The Simplest Form

from pyspark .sql import SparkSession

 

spark = SparkSession .builder

    .appName ( " MyApplication " )

    .getOrCreate ()

 

# Use spark

df = spark .read .csv ( "data.csv" , header = True )

df .show ()

 

# Always stop

spark .stop ()

What happens:

  • .builder : Starts the builder pattern
  • .appName (): Names your application (appears in Spark UI)
  • .getOrCreate (): Returns existing session if one exists, creates new otherwise
  • .stop (): Releases all resources
  •  

    SparkSession with Configuration

    Memory Configuration

    spark = SparkSession .builder

        .appName ( " DataProcessing " )

        .config ( " spark.driver .memory " , "2g" )

        .config ( " spark.executor .memory " , "4g" )

        .getOrCreate ()

    Explanation:

  • spark.driver .memory : How much RAM the Driver JVM gets (default 1g)
  • spark.executor .memory : How much RAM each Executor JVM gets (default 1g)
  • Example scenarios:

  • Large dataset: Increase to 8g or more
  • Limited laptop: Keep at 1-2g
  • Production cluster: 8-16g per executor
  •  

    CPU Core Configuration

    spark = SparkSession .builder

        .appName ( " DataProcessing " )

        .config ( " spark.executor .cores " , "4" )

        .config ( " spark.driver .cores " , "2" )

        .getOrCreate ()

    Explanation:

  • spark.executor .cores : Number of CPU cores per executor
  • spark.driver .cores : Number of CPU cores for driver
  • Practical values:

  • Laptop (4 cores total): executor=2, driver=1
  • Server (16 cores): executor=4, driver=2
  • Large cluster: executor=8, driver=4
  •  

    Shuffle and Partition Configuration

    spark = SparkSession .builder

        .appName ( " DataProcessing " )

        .config ( " spark.sql.shuffle .partitions " , "200" )

        .getOrCreate ()

    Explanation:

  • spark.sql.shuffle .partitions : Number of partitions after shuffle operations
  • Default 200 is for large clusters
  • Reduce to 50-100 for small data/laptops
  • Increase to 500+ for massive datasets
  • Example:

    # For laptop development

    spark = SparkSession .builder

        .appName ( "Dev" )

        .config ( " spark.sql.shuffle .partitions " , "4" )

        .getOrCreate ()

     

    # For production cluster

    spark = SparkSession .builder

        .appName ( "Production" )

        .config ( " spark.sql.shuffle .partitions " , "500" )

        .getOrCreate ()

     

    Master Configuration

    Local Mode (Development)

    # Single core

    spark = SparkSession .builder

        .master ( "local" )

        .appName ( " MyApp " )

        .getOrCreate ()

     

    # Multiple cores

    spark = SparkSession .builder

        .master ( "local" )

        .appName ( " MyApp " )

        .getOrCreate ()

     

    # All available cores

    spark = SparkSession .builder

        .master ( " local[ *]" )

        .appName ( " MyApp " )

        .getOrCreate ()

    Standalone Cluster

    spark = SparkSession .builder

        .master ( "spark://master-hostname:7077" )

        .appName ( " MyApp " )

        .getOrCreate ()

    YARN Cluster (Hadoop)

    spark = SparkSession .builder

        .master ( "yarn" )

        .appName ( " MyApp " )

        .getOrCreate ()

    Kubernetes

    spark = SparkSession .builder

        .master ( "k8s://https://kubernetes-api:443" )

        .appName ( " MyApp " )

        .getOrCreate ()

     

    Complete Practical Example

    Development SparkSession

    from pyspark .sql import SparkSession

     

    def create_dev_ spark ( ):

        """Create Spark session for development"""

        spark = SparkSession .builder

            .appName ( " DevelopmentApp " )

            .master ( "local" )

            .config ( " spark.driver .memory " , "2g" )

            .config ( " spark.sql.shuffle .partitions " , "4" )

            .config ( " spark.sql.adaptive .enabled " , "true" )

            .getOrCreate ()

       

        return spark

     

    spark = create_dev_ spark ( )

    print ( f "Spark Version: { spark .version } " )

    print ( f "App Name: { spark .sparkContext .appName } " )

     

    # Use it

    df = spark .read .csv ( "data.csv" , header = True )

    df .show ()

     

    spark .stop ()

    Production SparkSession

    def create_prod_ spark ( ):

        """Create Spark session for production"""

        spark = SparkSession .builder

            .appName ( " ProductionApp " )

            .master ( "yarn" )

            .config ( " spark.driver .memory " , "4g" )

            .config ( " spark.executor .memory " , "8g" )

            .config ( " spark.executor .cores " , "4" )

            .config ( " spark.sql.shuffle .partitions " , "500" )

            .config ( " spark.sql.adaptive .enabled " , "true" )

            .config ( " spark.dynamicAllocation.enabled " , "true" )

            .config ( " spark.dynamicAllocation.minExecutors " , "5" )

            .config ( " spark.dynamicAllocation.maxExecutors " , "100" )

            .getOrCreate ()

       

        return spark

     

    spark = create_prod_ spark ( )

    # Production processing

    spark .stop ()

     

    Getting Information from SparkSession

    Accessing Configuration

    spark = SparkSession .builder .appName ( " MyApp " ).getOrCreate ()

     

    # Get specific config

    shuffle_partitions = spark .conf .get ( " spark.sql.shuffle .partitions " )

    print ( f "Shuffle partitions: { shuffle_partitions } " )

     

    # Get all configs

    all_configs = spark .sparkContext .getConf ( ).getAll ()

    for key , value in all_configs :

        print ( f " { key } = { value } " )

     

    # Get app name

    app_name = spark .sparkContext .appName

    print ( f "App name: { app_name } " )

     

    # Get Spark version

    version = spark .version

    print ( f "Spark version: { version } " )

     

    # Get master

    master = spark .sparkContext .master

    print ( f "Master : { master } " )

     

    Common Patterns

    Pattern 1: Function to Create Session

    def get_spark_ session ( app_name , master = None, memory = "2g" , cores = 4 ):

        """Generic function to create SparkSession """

        builder = SparkSession .builder .appName ( app_name )

       

        if master :

            builder = builder .master ( master )

       

        builder = builder

            .config ( " spark.driver .memory " , memory )

            .config ( " spark.executor .cores " , cores )

       

        return builder .getOrCreate ()

     

    # Usage

    spark = get_spark_ session ( " MyApp " , master = " local[ *]" , memory = "2g" , cores = 4 )

    Pattern 2: Jupyter Notebook

    # SparkSession often exists automatically in Jupyter

    # But you can create one explicitly

     

    from pyspark .sql import SparkSession

     

    # In Jupyter , spark variable might already exist

    # But create explicitly to be safe

    spark = SparkSession .builder

        .appName ( " NotebookApp " )

        .config ( " spark.sql.shuffle .partitions " , "50" )

        .getOrCreate ()

     

    # Work with data

    df = spark .read .csv ( "data.csv" , header = True )

    df .display ()  # Databricks notebooks

     

    # No need to stop in notebooks (persists for session)

    Pattern 3: Production Job

    # main_job.py

    import sys

    from pyspark .sql import SparkSession

     

    def main ( ):

        # Create session

        spark = SparkSession .builder

            .appName ( f "Job _ { sys .argv } " )

            .getOrCreate ()

       

        try :

            # Do work

            df = spark .read .parquet ( "input/" )

            result = df .filter ( df .age > 25 )

            result .write .parquet ( "output/" )

            print ( "Job succeeded" )

       

        except Exception as e :

            print ( f "Job failed: { e } " )

            sys .exit ( 1 )

       

        finally :

            spark .stop ()

     

    if __name__ == "__main__" :

        main ( )

    💡 Best Practices

    1.  Always use builder pattern: Cleaner and more flexible than SparkContext

    2.  Set appName : Makes jobs identifiable in Spark UI

    3.  Configure based on environment: Dev vs Prod needs different settings

    4.  Always call stop( ): Especially in scripts, less critical in notebooks

    5.  Handle exceptions: Wrap code in try except finally

    6.  Don't create multiple sessions: Use getOrCreate ( ) to reuse

    📚 Study Notes

    •   SparkSession.builder : Entry point for creating sessions

      .appName (): Name your application (appears in Spark UI)

      .master (): Set execution mode (local, yarn, spark://, k8s://)

    •   .config( ): Add configuration settings

      .getOrCreate (): Get existing or create new session

      .stop (): Shutdown and release resources

    •   Dev config: Local mode, small memory, small partition count

    •   Prod config: YARN/K8s, large memory, large partition count, dynamic allocation

    •   Access config: spark.conf.get ( ), spark.sparkContext.getConf ()

    Leave a Reply