Selecting and Renaming Columns

 

Selecting and renaming columns is one of the most fundamental data manipulation tasks in Spark—mastering column selection enables you to work with subsets of data, reorder columns, and create cleaner, more understandable DataFrames —this practical topic shows you multiple ways to select, rename, and restructure your data.

 

Basic Column Selection

Select Single Column

from pyspark .sql import SparkSession

from pyspark .sql .functions import col

 

spark = SparkSession .builder .appName ( " SelectColumns " ).getOrCreate ()

 

# Read data

df = spark .read .csv ( "employees.csv" , header = True , inferSchema = True )

 

# Select single column (returns DataFrame )

df .select ( "name" ).show ()

 

# Select by column object

df .select ( col ( "name" ) ).show ()

 

# Both produce same result:

# +——-+

# |   name|

# +——-+

# |  Alice |

# |    Bob|

# |Charlie|

# +——-+

 

 

 

Select Multiple Columns

# Select multiple columns (order matters)

df .select ( "name" , "salary" , "department" ).show ()

 

# Using col( ) function

df .select ( col ( "name" ), col ( "salary" ), col ( "department" ) ).show ()

 

# Using list

columns = [ "name" , "salary" , "department" ]

df .select (* columns ).show ()  # Note the * unpacking

 

# Result:

# +——-+——+——-+

# |   name|salary|depart |

# +——-+——+——-+

# |  Alice | 75000|Eng   |

# |    Bob| 85000|Eng   |

# |Charlie| 65000|Sales |

# +——-+——+——-+

 

 

 

Select with Expressions

Selecting with Calculations

# Select column and calculate

df .select (

    "name" ,

    col ( "salary" ),

    ( col ( "salary" ) * 1.1 ).alias ( " salary_with_raise " )

).show ()

 

# Result:

# +——-+——+—————-+

# |   name|salary|salary_with_raise |

# +——-+——+—————-+

# |  Alice | 75000|82500.0         |

# |    Bob| 85000|93500.0         |

# +——-+——+—————-+

 

 

Select with Conditions

from pyspark .sql .functions import when

 

# Select with conditional column

df .select (

    "name" ,

    "salary" ,

    when ( col ( "salary" ) > 80000 , "High" )

    .when ( col ( "salary" ) > 70000 , "Medium" )

    .otherwise ( "Low" ).alias ( " salary_range " )

).show ()

 

 

Renaming Columns

Method 1: withColumnRenamed ( )

# Rename single column

df_renamed = df .withColumnRenamed ( "salary" , " annual_salary " )

df_ renamed .printSchema ()

 

# Rename multiple columns (chain them)

df_renamed = df .withColumnRenamed ( "salary" , " annual_salary " )

              .withColumnRenamed ( "department" , "dept" )

 

df_ renamed .show ()

 

 

Method 2: Using select( ) with alias( )

# Select with aliases (rename while selecting)

df_renamed = df .select (

    col ( "id" ),

    col ( "name" ).alias ( " employee_name " ),

    col ( "age" ).alias ( " years_old " ),

    col ( "salary" ).alias ( " annual_salary " ),

    col ( "department" ).alias ( "dept" )

)

 

df_ renamed .printSchema ()

# root

#  |– id: integer

#  |– employee_name : string

#  |– years_old : integer

#  |– annual_salary : integer

#  |– dept: string

 

 

Method 3: Bulk Rename

# Rename all columns at once

old_names = df .columns  # [“id”, “name”, “age”, “salary”, “department”]

new_names = [ " emp_id " , " emp_name " , " emp_age " , " emp_salary " , " emp_dept " ]

 

# Create rename mapping

rename_mapping = dict ( zip ( old_names , new_names ))

 

# Apply renames

df_renamed = df .select ([ col ( c ).alias ( rename_mapping [ c ]) for c in old_names ])

 

df_ renamed .show ()

 

 

Select All Columns Plus New Ones

Select * with Additional Columns

# Keep all columns and add new ones

df_enhanced = df .select (

    "* " ,  # All original columns

    ( col ( "salary" ) * 1.1 ).alias ( " new_salary " ),

    when ( col ( "salary" ) > 80000 , "High" ).otherwise ( "Low" ).alias ( " salary_level " )

)

 

df_ enhanced .show ()

 

 

 

Practical Examples

Example 1: Clean Messy Column Names

# Create CSV with messy names

cat > messy_data .csv << 'EOF'

Employee ID , First Name , Last Name , Annual $ alary , Department , Start Date

1 , Alice , Smith , 75000 , Engineering , 2020 0 1- 15

2 , Bob , Johnson , 85000 , Sales , 2019 0 3- 20

3 , Charlie , Brown , 65000 , Marketing , 2021 0 6- 10

EOF

 

 

# Read and clean names

df = spark .read .csv ( "messy_data.csv" , header = True , inferSchema = True )

 

df_clean = df .select (

    col ( "Employee ID" ).alias ( " employee_id " ),

    col ( "First Name" ).alias ( " first_name " ),

    col ( "Last Name" ).alias ( " last_name " ),

    col ( "Annual $alary" ).alias ( " annual_salary " ),

    col ( "Department" ).alias ( "department" ),

    col ( "Start Date" ).alias ( " start_date " )

)

 

df_ clean .show ()

df_ clean .printSchema ()

 

 

Output:

+———-+———-+——–+————-+———-+———-+

| employee_id | first_name | last_name | annual_salary | department | start_date |

+———-+———-+——–+————-+———-+———-+

|         1 |     Alice |    Smith |        75000 | Engineering | 2020 0 1- 15 |

|         2 |       Bob |   Johnson |        85000 |      Sales | 2019 0 3- 20 |

|         3 |   Charlie |     Brown |        65000 | Marketing | 2021 0 6- 10 |

+———-+———-+——–+————-+———-+———-+

 

 

Example 2: Reorder Columns

# Original order

df .columns  # [‘id’, ‘name’, ‘age’, ‘salary’, ‘department’]

 

# Reorder to preferred order

df_reordered = df .select ( "id" , "name" , "department" , "salary" , "age" )

 

df_ reordered .show ()

 

 

Example 3: Select Specific Column Types

# Select only numeric columns

numeric_cols = [ c for c , t in df .dtypes if t in ( 'int' , 'double' , 'long' )]

df_numeric = df .select ( numeric_cols )

 

# Select only string columns

string_cols = [ c for c , t in df .dtypes if t == 'string' ]

df_strings = df .select ( string_cols )

 

 

Example 4: Drop Columns

# Drop specific columns (opposite of select)

df_dropped = df .select ([ col for col in df .columns if col != "age" ])

 

# Or using drop( ) method

df_dropped = df .drop ( "age" , " ssn " )  # Drop multiple columns

 

df_ dropped .show ()

 

 

 

Advanced Selection

Select with Regex

# Select columns matching pattern

# Select columns starting with "s"

df_selected = df .select ( "`s.*` " )  # Backticks for regex

 

# Select columns ending with "id"

df_ids = df .select ( " `.* id`" )

 

 

Select Multiple Columns Dynamically

# Select top N columns by name alphabetically

top_3_cols = sorted ( df .columns )[ :3 ]

df_selected = df .select (* top_3_cols )

 

 

Common Patterns

Pattern 1: Add Column Prefix/Suffix

# Add prefix to all columns

def add_ prefix ( df , prefix ):

    return df .select ([ col ( c ).alias ( f " { prefix } _ { c } " ) for c in df .columns ])

 

df_prefixed = add_ prefix ( df , "emp" )

# Columns become: emp_id , emp_name , emp_age , …

 

 

Pattern 2: Select Columns Except List

# Select all columns except specific ones

exclude_cols = [ " ssn " , "password" , " internal_notes " ]

selected_cols = [ col for col in df .columns if col not in exclude_cols ]

df_safe = df .select (* selected_cols )

 

 

Pattern 3: Rename for SQL Query

# Rename columns to be SQL-friendly

df_sql = df .select ([ col ( c ).alias ( c .lower ( ).replace ( " " , "_" )) for c in df .columns ])

 

# Now can use in SQL directly

df_ sql .createOrReplaceTempView ( "employees" )

spark .sql ( "SELECT * FROM employees WHERE salary > 80000" ).show ()

 

 

💡 Best Practices

•   Use alias( ) when selecting: Makes intent clear

•   Clean column names early: In data ingestion phase

•   Use meaningful names: Make code self-documenting

•   Avoid spaces in names: Causes SQL issues

•   Lowercase column names: Consistent convention

•   Use snake_case : Pythonic naming convention

•   Document column meanings: In schemas or comments

 

 

📚 Study Notes

•   select( ): Choose specific columns

•   alias( ): Rename column in select( )

•   withColumnRenamed ( ): Rename existing column

•   Column references: String names or col( ) objects

•   drop( ): Remove columns (opposite of select)

•   Expressions: Can include calculations and conditions in select( )

•   Chaining: Multiple renames can be chained

 

 

 

Leave a Reply