Reading CSV Files into DataFrames

Reading CSV files is one of the most common data ingestion tasks in Spark, and mastering the various options enables you to handle diverse CSV formats, from simple pipe-delimited files to complex CSVs with quoted fields and special characters—understanding how to configure the CSV reader transforms raw data into clean, typed DataFrames ready for analysis.

Basic CSV Reading

Simplest Approach

from pyspark .sql import SparkSession

spark = SparkSession .builder .appName ( " ReadCSV " ).getOrCreate ()

# Read CSV with headers

df = spark .read .csv ( "data.csv" , header = True )

df .show ()

spark .stop ()

Result: DataFrame with string columns (all inferred as string type)

Schema Inference

Let Spark Infer Schema

df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )

df .printSchema ()

# Output:

# root

# |– id: integer (nullable = true)

# |– name: string (nullable = true)

# |– age: integer (nullable = true)

# |– salary: double (nullable = true)

How it works: Spark reads a sample of the file and infers data types

CSV Options

Common Options

df = spark .read .csv (

"data.csv" ,

header = True , # First row is header

inferSchema = True , # Infer column types

sep = ", " , # Column separator (default: comma)

quote = '" ' , # Quote character

escape = " \ " , # Escape character

multiLine = False , # Allow values to span multiple lines

nullValue = " " , # String representing NULL

emptyValue = " " , # String representing empty field

encoding = "utf-8" # File encoding

)

df .show ()

Different Delimiters

# Pipe-delimited

df_pipe = spark .read .csv ( " data.psv " , header = True , sep = "|" , inferSchema = True )

# Tab-delimited

df_tab = spark .read .csv ( " data.tsv " , header = True , sep = " t " , inferSchema = True )

# Semicolon-delimited (European CSV)

df_semi = spark .read .csv ( "data.csv" , header = True , sep = ";" , inferSchema = True )

# Custom delimiter

df_custom = spark .read .csv ( "data.csv" , header = True , sep = "||" , inferSchema = True )

Handling Special Cases

# File with quotes in values

df = spark .read .csv (

"data.csv" ,

header = True ,

inferSchema = True ,

quote = '"' ,

escape = '"'

)

# File with NULL represented as "NA"

df = spark .read .csv (

"data.csv" ,

header = True ,

inferSchema = True ,

nullValue = "NA"

)

# File with multi-line values

df = spark .read .csv (

"data.csv" ,

header = True ,

inferSchema = True ,

multiLine = True ,

quote = '"'

)

Complete Practical Example

Sample CSV File

cat > employees .csv << 'EOF'

id , name , age , city , salary , department

1 , Alice , 28 , New York , 75000 , Engineering

2 , Bob , 32 , San Francisco , 85000 , Engineering

3 , Charlie , 25 , New York , 65000 , Sales

4 , Diana , 35 , Boston , 90000 , Engineering

5 , Eve , 29 , San Francisco , 80000 , Sales

6 , Frank , 31 , New York , 88000 , Engineering

7 , Grace , 26 , Boston , 70000 , Marketing

8 , Henry , 33 , San Francisco , 95000 , Engineering

EOF

Read and Analyze

from pyspark .sql import SparkSession

spark = SparkSession .builder

.appName ( " ReadCSVPractical " )

.master ( " local[ *]" )

.config ( " spark.sql.shuffle .partitions " , "4" )

.getOrCreate ()

# Read CSV

df = spark .read .csv (

"employees.csv" ,

header = True ,

inferSchema = True

)

print ( "=== Schema ===" )

df .printSchema ()

print ( " n === First 5 rows ===" )

df .show ( 5 )

print ( " n === Data types ===" )

print ( df .dtypes )

print ( " n === Row count ===" )

print ( f "Total rows: { df .count () } " )

print ( " n === Get specific column ===" )

df .select ( "name" , "salary" ).show ()

print ( " n === Filter rows ===" )

high_earners = df .filter ( df .salary > 80000 )

high_ earners .show ()

spark .stop ()

Output:

=== Schema ===

root

|– id : integer ( nullable = true )

|– name: string ( nullable = true )

|– age: integer ( nullable = true )

|– city: string ( nullable = true )

|– salary: integer ( nullable = true )

|– department: string ( nullable = true )

=== First 5 rows ===

+—+——-+—+————-+——+——-

+—+——-+—+————-+——+——-

| 1 | Alice | 28 | New York | 75000 | Engineering

| 2 | Bob | 32 | San Francisco | 85000 | Engineering

| 3 | Charlie | 25 | New York | 65000 | Sales

| 4 | Diana | 35 | Boston | 90000 | Engineering

| 5 | Eve | 29 | San Francisco | 80000 | Sales

+—+——-+—+————-+——+——-

Reading from Different Sources

From HDFS

df = spark .read .csv ( " hdfs ://namenode:9000/data/employees.csv" , header = True , inferSchema = True )

From Azure

df = spark .read .csv ( " abfss ://employees.csv" , header = True , inferSchema = True )

From URL

# Read directly from HTTP

df = spark .read .csv (

"https://example.com/data/employees.csv" ,

header = True ,

inferSchema = True

)

From Directory with Multiple Files

# Read all CSV files in a directory

df = spark .read .csv (

" data_directory /*.csv" ,

header = True ,

inferSchema = True

)

Performance Considerations

Sample Rows for Schema Inference

# By default, Spark samples 100 rows

# For large files with varied data, increase sample size

df = spark .read .csv (

"large_file.csv" ,

header = True ,

inferSchema = True ,

samplingRatio = 0.1 # Sample 10% of file

)

Cache after Reading

df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )

df .cache () # Keep in memory for repeated operations

# Trigger cache

df .count () # Forces evaluation

df .groupBy ( "department" ).count ( ).show () # Uses cached data

df .filter ( df .salary > 80000 ).show () # Uses cached data

💡 Best Practices

• Always use header=True if your CSV has headers

• Use inferSchema =True for initial exploration, then explicit schema for production

• Specify nullValue if your file uses special NULL representations

• Handle multiLine =True only if needed (performance cost)

• Test with small samples first before reading huge files

• Cache frequently-used DataFrames to avoid re-reading

📚 Study Notes

• Basic read: spark.read.csv( file, header =True)

• With schema inference: Add inferSchema =True

• Different delimiters: Use sep parameter

• Special characters: Configure quote, escape

• Multi-line values: Use multiLine =True

• NULL representation: Use nullValue parameter

• Multiple files: Use wildcard *.csv

• Caching: Call .cache () for repeated use

Navigation

Apache Spark Practicals

Leave a Reply Cancel reply