Reading CSV Files into DataFrames

 

Reading CSV files is one of the most common data ingestion tasks in Spark, and mastering the various options enables you to handle diverse CSV formats, from simple pipe-delimited files to complex CSVs with quoted fields and special characters—understanding how to configure the CSV reader transforms raw data into clean, typed DataFrames ready for analysis.

 

Basic CSV Reading

Simplest Approach

from pyspark .sql import SparkSession

 

spark = SparkSession .builder .appName ( " ReadCSV " ).getOrCreate ()

 

# Read CSV with headers

df = spark .read .csv ( "data.csv" , header = True )

df .show ()

 

spark .stop ()

Result: DataFrame with string columns (all inferred as string type)

Schema Inference

Let Spark Infer Schema

df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )

df .printSchema ()

 

# Output:

# root

#  |– id: integer (nullable = true)

#  |– name: string (nullable = true)

#  |– age: integer (nullable = true)

#  |– salary: double (nullable = true)

How it works: Spark reads a sample of the file and infers data types

CSV Options

Common Options

df = spark .read .csv (

    "data.csv" ,

    header = True ,              # First row is header

    inferSchema = True ,          # Infer column types

    sep = ", " ,                   # Column separator (default: comma)

    quote = '" ' ,                 # Quote character

    escape = " \ " ,               # Escape character

    multiLine = False ,           # Allow values to span multiple lines

    nullValue = " " ,              # String representing NULL

    emptyValue = " " ,             # String representing empty field

    encoding = "utf-8"           # File encoding

)

 

df .show ()

Different Delimiters

# Pipe-delimited

df_pipe = spark .read .csv ( " data.psv " , header = True , sep = "|" , inferSchema = True )

 

# Tab-delimited

df_tab = spark .read .csv ( " data.tsv " , header = True , sep = " t " , inferSchema = True )

 

# Semicolon-delimited (European CSV)

df_semi = spark .read .csv ( "data.csv" , header = True , sep = ";" , inferSchema = True )

 

# Custom delimiter

df_custom = spark .read .csv ( "data.csv" , header = True , sep = "||" , inferSchema = True )

Handling Special Cases

# File with quotes in values

df = spark .read .csv (

    "data.csv" ,

    header = True ,

    inferSchema = True ,

    quote = '"' ,

    escape = '"'

)

 

# File with NULL represented as "NA"

df = spark .read .csv (

    "data.csv" ,

    header = True ,

    inferSchema = True ,

    nullValue = "NA"

)

 

# File with multi-line values

df = spark .read .csv (

    "data.csv" ,

    header = True ,

    inferSchema = True ,

    multiLine = True ,

    quote = '"'

)

 

Complete Practical Example

Sample CSV File

cat > employees .csv << 'EOF'

id , name , age , city , salary , department

1 , Alice , 28 , New York , 75000 , Engineering

2 , Bob , 32 , San Francisco , 85000 , Engineering

3 , Charlie , 25 , New York , 65000 , Sales

4 , Diana , 35 , Boston , 90000 , Engineering

5 , Eve , 29 , San Francisco , 80000 , Sales

6 , Frank , 31 , New York , 88000 , Engineering

7 , Grace , 26 , Boston , 70000 , Marketing

8 , Henry , 33 , San Francisco , 95000 , Engineering

EOF

 

 

Read and Analyze

from pyspark .sql import SparkSession

 

spark = SparkSession .builder

    .appName ( " ReadCSVPractical " )

    .master ( " local[ *]" )

    .config ( " spark.sql.shuffle .partitions " , "4" )

    .getOrCreate ()

 

# Read CSV

df = spark .read .csv (

    "employees.csv" ,

    header = True ,

    inferSchema = True

)

 

print ( "=== Schema ===" )

df .printSchema ()

 

print ( " n === First 5 rows ===" )

df .show ( 5 )

 

print ( " n === Data types ===" )

print ( df .dtypes )

 

print ( " n === Row count ===" )

print ( f "Total rows: { df .count () } " )

 

print ( " n === Get specific column ===" )

df .select ( "name" , "salary" ).show ()

 

print ( " n === Filter rows ===" )

high_earners = df .filter ( df .salary > 80000 )

high_ earners .show ()

 

spark .stop ()

 

 

Output:

=== Schema ===

root

 |– id : integer ( nullable = true )

 |– name: string ( nullable = true )

 |– age: integer ( nullable = true )

 |– city: string ( nullable = true )

 |– salary: integer ( nullable = true )

 |– department: string ( nullable = true )

 

=== First 5 rows ===

+—+——-+—+————-+——+——-

| id |   name | age |         city | salary | department

+—+——-+—+————-+——+——-

|  1 |  Alice | 28 |     New York | 75000 | Engineering

|  2 |    Bob | 32 | San Francisco | 85000 | Engineering

|  3 | Charlie | 25 |     New York | 65000 | Sales

|  4 | Diana | 35 |       Boston | 90000 | Engineering

|  5 |   Eve | 29 | San Francisco | 80000 | Sales

+—+——-+—+————-+——+——-

 

Reading from Different Sources

From HDFS

df = spark .read .csv ( " hdfs ://namenode:9000/data/employees.csv" , header = True , inferSchema = True )

From Azure

df = spark .read .csv ( " abfss ://employees.csv" , header = True , inferSchema = True )

From URL

# Read directly from HTTP

df = spark .read .csv (

    "https://example.com/data/employees.csv" ,

    header = True ,

    inferSchema = True

)

From Directory with Multiple Files

# Read all CSV files in a directory

df = spark .read .csv (

    " data_directory /*.csv" ,

    header = True ,

    inferSchema = True

)

 

Performance Considerations

Sample Rows for Schema Inference

# By default, Spark samples 100 rows

# For large files with varied data, increase sample size

df = spark .read .csv (

    "large_file.csv" ,

    header = True ,

    inferSchema = True ,

    samplingRatio = 0.1  # Sample 10% of file

)

Cache after Reading

df = spark .read .csv ( "data.csv" , header = True , inferSchema = True )

df .cache ()  # Keep in memory for repeated operations

 

# Trigger cache

df .count ()  # Forces evaluation

 

df .groupBy ( "department" ).count ( ).show ()  # Uses cached data

df .filter ( df .salary > 80000 ).show ()      # Uses cached data

💡 Best Practices

•   Always use header=True if your CSV has headers

•   Use inferSchema =True for initial exploration, then explicit schema for production

•   Specify nullValue if your file uses special NULL representations

•   Handle multiLine =True only if needed (performance cost)

•   Test with small samples first before reading huge files

•   Cache frequently-used DataFrames to avoid re-reading

 

 

📚 Study Notes

•   Basic read: spark.read.csv( file, header =True)

•   With schema inference: Add inferSchema =True

•   Different delimiters: Use sep parameter

•   Special characters: Configure quote, escape

•   Multi-line values: Use multiLine =True

•   NULL representation: Use nullValue parameter

•   Multiple files: Use wildcard *.csv

•   Caching: Call .cache () for repeated use

 

 

Leave a Reply