Selecting and Renaming Columns
Selecting and renaming columns is one of the most fundamental data manipulation tasks in Spark—mastering column selection enables you to work with subsets of data, reorder columns, and create cleaner, more understandable DataFrames —this practical topic shows you multiple ways to select, rename, and restructure your data.
Basic Column Selection
Select Single Column
from pyspark .sql import SparkSession
from pyspark .sql .functions import col
spark = SparkSession .builder .appName ( " SelectColumns " ).getOrCreate ()
# Read data
df = spark .read .csv ( "employees.csv" , header = True , inferSchema = True )
# Select single column (returns DataFrame )
df .select ( "name" ).show ()
# Select by column object
df .select ( col ( "name" ) ).show ()
# Both produce same result:
# +——-+
# | name|
# +——-+
# | Alice |
# | Bob|
# |Charlie|
# +——-+
Select Multiple Columns
# Select multiple columns (order matters)
df .select ( "name" , "salary" , "department" ).show ()
# Using col( ) function
df .select ( col ( "name" ), col ( "salary" ), col ( "department" ) ).show ()
# Using list
columns = [ "name" , "salary" , "department" ]
df .select (* columns ).show () # Note the * unpacking
# Result:
# +——-+——+——-+
# | name|salary|depart |
# +——-+——+——-+
# | Alice | 75000|Eng |
# | Bob| 85000|Eng |
# |Charlie| 65000|Sales |
# +——-+——+——-+
Select with Expressions
Selecting with Calculations
# Select column and calculate
df .select (
"name" ,
col ( "salary" ),
( col ( "salary" ) * 1.1 ).alias ( " salary_with_raise " )
).show ()
# Result:
# +——-+——+—————-+
# | name|salary|salary_with_raise |
# +——-+——+—————-+
# | Alice | 75000|82500.0 |
# | Bob| 85000|93500.0 |
# +——-+——+—————-+
Select with Conditions
from pyspark .sql .functions import when
# Select with conditional column
df .select (
"name" ,
"salary" ,
when ( col ( "salary" ) > 80000 , "High" )
.when ( col ( "salary" ) > 70000 , "Medium" )
.otherwise ( "Low" ).alias ( " salary_range " )
).show ()
Renaming Columns
Method 1: withColumnRenamed ( )
# Rename single column
df_renamed = df .withColumnRenamed ( "salary" , " annual_salary " )
df_ renamed .printSchema ()
# Rename multiple columns (chain them)
df_renamed = df .withColumnRenamed ( "salary" , " annual_salary " )
.withColumnRenamed ( "department" , "dept" )
df_ renamed .show ()
Method 2: Using select( ) with alias( )
# Select with aliases (rename while selecting)
df_renamed = df .select (
col ( "id" ),
col ( "name" ).alias ( " employee_name " ),
col ( "age" ).alias ( " years_old " ),
col ( "salary" ).alias ( " annual_salary " ),
col ( "department" ).alias ( "dept" )
)
df_ renamed .printSchema ()
# root
# |– id: integer
# |– employee_name : string
# |– years_old : integer
# |– annual_salary : integer
# |– dept: string
Method 3: Bulk Rename
# Rename all columns at once
old_names = df .columns # [“id”, “name”, “age”, “salary”, “department”]
new_names = [ " emp_id " , " emp_name " , " emp_age " , " emp_salary " , " emp_dept " ]
# Create rename mapping
rename_mapping = dict ( zip ( old_names , new_names ))
# Apply renames
df_renamed = df .select ([ col ( c ).alias ( rename_mapping [ c ]) for c in old_names ])
df_ renamed .show ()
Select All Columns Plus New Ones
Select * with Additional Columns
# Keep all columns and add new ones
df_enhanced = df .select (
"* " , # All original columns
( col ( "salary" ) * 1.1 ).alias ( " new_salary " ),
when ( col ( "salary" ) > 80000 , "High" ).otherwise ( "Low" ).alias ( " salary_level " )
)
df_ enhanced .show ()
Practical Examples
Example 1: Clean Messy Column Names
# Create CSV with messy names
cat > messy_data .csv << 'EOF'
Employee ID , First Name , Last Name , Annual $ alary , Department , Start Date
1 , Alice , Smith , 75000 , Engineering , 2020 – 0 1- 15
2 , Bob , Johnson , 85000 , Sales , 2019 – 0 3- 20
3 , Charlie , Brown , 65000 , Marketing , 2021 – 0 6- 10
EOF
# Read and clean names
df = spark .read .csv ( "messy_data.csv" , header = True , inferSchema = True )
df_clean = df .select (
col ( "Employee ID" ).alias ( " employee_id " ),
col ( "First Name" ).alias ( " first_name " ),
col ( "Last Name" ).alias ( " last_name " ),
col ( "Annual $alary" ).alias ( " annual_salary " ),
col ( "Department" ).alias ( "department" ),
col ( "Start Date" ).alias ( " start_date " )
)
df_ clean .show ()
df_ clean .printSchema ()
Output:
+———-+———-+——–+————-+———-+———-+
| employee_id | first_name | last_name | annual_salary | department | start_date |
+———-+———-+——–+————-+———-+———-+
| 1 | Alice | Smith | 75000 | Engineering | 2020 – 0 1- 15 |
| 2 | Bob | Johnson | 85000 | Sales | 2019 – 0 3- 20 |
| 3 | Charlie | Brown | 65000 | Marketing | 2021 – 0 6- 10 |
+———-+———-+——–+————-+———-+———-+
Example 2: Reorder Columns
# Original order
df .columns # [‘id’, ‘name’, ‘age’, ‘salary’, ‘department’]
# Reorder to preferred order
df_reordered = df .select ( "id" , "name" , "department" , "salary" , "age" )
df_ reordered .show ()
Example 3: Select Specific Column Types
# Select only numeric columns
numeric_cols = [ c for c , t in df .dtypes if t in ( 'int' , 'double' , 'long' )]
df_numeric = df .select ( numeric_cols )
# Select only string columns
string_cols = [ c for c , t in df .dtypes if t == 'string' ]
df_strings = df .select ( string_cols )
Example 4: Drop Columns
# Drop specific columns (opposite of select)
df_dropped = df .select ([ col for col in df .columns if col != "age" ])
# Or using drop( ) method
df_dropped = df .drop ( "age" , " ssn " ) # Drop multiple columns
df_ dropped .show ()
Advanced Selection
Select with Regex
# Select columns matching pattern
# Select columns starting with "s"
df_selected = df .select ( "`s.*` " ) # Backticks for regex
# Select columns ending with "id"
df_ids = df .select ( " `.* id`" )
Select Multiple Columns Dynamically
# Select top N columns by name alphabetically
top_3_cols = sorted ( df .columns )[ :3 ]
df_selected = df .select (* top_3_cols )
Common Patterns
Pattern 1: Add Column Prefix/Suffix
# Add prefix to all columns
def add_ prefix ( df , prefix ):
return df .select ([ col ( c ).alias ( f " { prefix } _ { c } " ) for c in df .columns ])
df_prefixed = add_ prefix ( df , "emp" )
# Columns become: emp_id , emp_name , emp_age , …
Pattern 2: Select Columns Except List
# Select all columns except specific ones
exclude_cols = [ " ssn " , "password" , " internal_notes " ]
selected_cols = [ col for col in df .columns if col not in exclude_cols ]
df_safe = df .select (* selected_cols )
Pattern 3: Rename for SQL Query
# Rename columns to be SQL-friendly
df_sql = df .select ([ col ( c ).alias ( c .lower ( ).replace ( " " , "_" )) for c in df .columns ])
# Now can use in SQL directly
df_ sql .createOrReplaceTempView ( "employees" )
spark .sql ( "SELECT * FROM employees WHERE salary > 80000" ).show ()
💡 Best Practices
• Use alias( ) when selecting: Makes intent clear
• Clean column names early: In data ingestion phase
• Use meaningful names: Make code self-documenting
• Avoid spaces in names: Causes SQL issues
• Lowercase column names: Consistent convention
• Use snake_case : Pythonic naming convention
• Document column meanings: In schemas or comments
📚 Study Notes
• select( ): Choose specific columns
• alias( ): Rename column in select( )
• withColumnRenamed ( ): Rename existing column
• Column references: String names or col( ) objects
• drop( ): Remove columns (opposite of select)
• Expressions: Can include calculations and conditions in select( )
• Chaining: Multiple renames can be chained