Reading CSV Files into DataFrames
Reading CSV files is one of the most common data ingestion tasks in Spark, and mastering the various options enables you to handle diverse CSV formats, from simple pipe-delimited files to complex CSVs with quoted fields and special characters—understanding how to configure the CSV reader transforms raw data into clean, typed DataFrames ready for analysis.
Basic CSV Reading
Simplest Approach
from pyspark .sql import SparkSession
spark = SparkSession .builder .appName ( " ReadCSV " ).getOrCreate ()
# Read CSV with headers
df = spark .read .csv ( "data.csv" , header = True )
df .show ()
spark .stop ()
Result: DataFrame with string columns (all inferred as string type)
Schema Inference
Let Spark Infer Schema
df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )
df .printSchema ()
# Output:
# root
# |– id: integer (nullable = true)
# |– name: string (nullable = true)
# |– age: integer (nullable = true)
# |– salary: double (nullable = true)
How it works: Spark reads a sample of the file and infers data types
CSV Options
Common Options
df = spark .read .csv (
"data.csv" ,
header = True , # First row is header
inferSchema = True , # Infer column types
sep = ", " , # Column separator (default: comma)
quote = '" ' , # Quote character
escape = " \ " , # Escape character
multiLine = False , # Allow values to span multiple lines
nullValue = " " , # String representing NULL
emptyValue = " " , # String representing empty field
encoding = "utf-8" # File encoding
)
df .show ()
Different Delimiters
# Pipe-delimited
df_pipe = spark .read .csv ( " data.psv " , header = True , sep = "|" , inferSchema = True )
# Tab-delimited
df_tab = spark .read .csv ( " data.tsv " , header = True , sep = " t " , inferSchema = True )
# Semicolon-delimited (European CSV)
df_semi = spark .read .csv ( "data.csv" , header = True , sep = ";" , inferSchema = True )
# Custom delimiter
df_custom = spark .read .csv ( "data.csv" , header = True , sep = "||" , inferSchema = True )
Handling Special Cases
# File with quotes in values
df = spark .read .csv (
"data.csv" ,
header = True ,
inferSchema = True ,
quote = '"' ,
escape = '"'
)
# File with NULL represented as "NA"
df = spark .read .csv (
"data.csv" ,
header = True ,
inferSchema = True ,
nullValue = "NA"
)
# File with multi-line values
df = spark .read .csv (
"data.csv" ,
header = True ,
inferSchema = True ,
multiLine = True ,
quote = '"'
)
Complete Practical Example
Sample CSV File
cat > employees .csv << 'EOF'
id , name , age , city , salary , department
1 , Alice , 28 , New York , 75000 , Engineering
2 , Bob , 32 , San Francisco , 85000 , Engineering
3 , Charlie , 25 , New York , 65000 , Sales
4 , Diana , 35 , Boston , 90000 , Engineering
5 , Eve , 29 , San Francisco , 80000 , Sales
6 , Frank , 31 , New York , 88000 , Engineering
7 , Grace , 26 , Boston , 70000 , Marketing
8 , Henry , 33 , San Francisco , 95000 , Engineering
EOF
Read and Analyze
from pyspark .sql import SparkSession
spark = SparkSession .builder
.appName ( " ReadCSVPractical " )
.master ( " local[ *]" )
.config ( " spark.sql.shuffle .partitions " , "4" )
.getOrCreate ()
# Read CSV
df = spark .read .csv (
"employees.csv" ,
header = True ,
inferSchema = True
)
print ( "=== Schema ===" )
df .printSchema ()
print ( " n === First 5 rows ===" )
df .show ( 5 )
print ( " n === Data types ===" )
print ( df .dtypes )
print ( " n === Row count ===" )
print ( f "Total rows: { df .count () } " )
print ( " n === Get specific column ===" )
df .select ( "name" , "salary" ).show ()
print ( " n === Filter rows ===" )
high_earners = df .filter ( df .salary > 80000 )
high_ earners .show ()
spark .stop ()
Output:
=== Schema ===
root
|– id : integer ( nullable = true )
|– name: string ( nullable = true )
|– age: integer ( nullable = true )
|– city: string ( nullable = true )
|– salary: integer ( nullable = true )
|– department: string ( nullable = true )
=== First 5 rows ===
+—+——-+—+————-+——+——-
| id | name | age | city | salary | department
+—+——-+—+————-+——+——-
| 1 | Alice | 28 | New York | 75000 | Engineering
| 2 | Bob | 32 | San Francisco | 85000 | Engineering
| 3 | Charlie | 25 | New York | 65000 | Sales
| 4 | Diana | 35 | Boston | 90000 | Engineering
| 5 | Eve | 29 | San Francisco | 80000 | Sales
+—+——-+—+————-+——+——-
Reading from Different Sources
From HDFS
df = spark .read .csv ( " hdfs ://namenode:9000/data/employees.csv" , header = True , inferSchema = True )
From Azure
df = spark .read .csv ( " abfss ://employees.csv" , header = True , inferSchema = True )
From URL
# Read directly from HTTP
df = spark .read .csv (
"https://example.com/data/employees.csv" ,
header = True ,
inferSchema = True
)
From Directory with Multiple Files
# Read all CSV files in a directory
df = spark .read .csv (
" data_directory /*.csv" ,
header = True ,
inferSchema = True
)
Performance Considerations
Sample Rows for Schema Inference
# By default, Spark samples 100 rows
# For large files with varied data, increase sample size
df = spark .read .csv (
"large_file.csv" ,
header = True ,
inferSchema = True ,
samplingRatio = 0.1 # Sample 10% of file
)
Cache after Reading
df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )
df .cache () # Keep in memory for repeated operations
# Trigger cache
df .count () # Forces evaluation
df .groupBy ( "department" ).count ( ).show () # Uses cached data
df .filter ( df .salary > 80000 ).show () # Uses cached data
💡 Best Practices
• Always use header=True if your CSV has headers
• Use inferSchema =True for initial exploration, then explicit schema for production
• Specify nullValue if your file uses special NULL representations
• Handle multiLine =True only if needed (performance cost)
• Test with small samples first before reading huge files
• Cache frequently-used DataFrames to avoid re-reading
📚 Study Notes
• Basic read: spark.read.csv( file, header =True)
• With schema inference: Add inferSchema =True
• Different delimiters: Use sep parameter
• Special characters: Configure quote, escape
• Multi-line values: Use multiLine =True
• NULL representation: Use nullValue parameter
• Multiple files: Use wildcard *.csv
• Caching: Call .cache () for repeated use