Reading JSON Files with Nested Structures

JSON files often contain nested structures, arrays, and objects that require specialized handling techniques—mastering JSON ingestion in Spark enables you to work with modern API responses, hierarchical data, and complex real-world datasets—understanding how to flatten, explode, and navigate nested JSON transforms Spark from a tabular tool into a powerful tool for semi-structured data.

JSON vs CSV

Aspect	CSV	JSON
Format	Flat, tabular	Nested, hierarchical
Nesting	Not supported	Full support
Arrays	Not native	Native support
Readability	Easy	Complex for nested
Performance	Fast	Slower (more parsing)

Reading Flat JSON

JSON Lines Format (Most Common)

# Create sample JSON (one JSON object per line)

cat > users .jsonl << 'EOF'

{ "id" :1 , "name" :"Alice" , "age" :28 , "city" :"New York" }

{ "id" :2 , "name" :"Bob" , "age" :32 , "city" :"San Francisco" }

{ "id" :3 , "name" :"Charlie" , "age" :25 , "city" :"Boston" }

EOF

Read with Spark

from pyspark .sql import SparkSession

spark = SparkSession .builder .appName ( " ReadJSON " ).getOrCreate ()

# Read JSON lines (default format)

df = spark .read .json ( " users.jsonl " )

df .printSchema ()

# root

# |– age: long (nullable = true)

# |– city: string (nullable = true)

# |– id: long (nullable = true)

# |– name: string (nullable = true)

df .show ()

Reading Nested JSON

Nested Structure Example

cat > orders .jsonl << 'EOF'

{ "order_id" :1 , "customer " :{ "name" :"Alice" , "age" :28 } , "items " :[ { "product" :"Laptop" , "price" :1000 } , { "product" :"Mouse" , "price" :50 } ], "total" :1050 }

{ "order_id" :2 , "customer " :{ "name" :"Bob" , "age" :32 } , "items " :[ { "product" :"Phone" , "price" :800 } ], "total" :800 }

EOF

Read and Explore

df = spark .read .json ( " orders.jsonl " )

df .printSchema ()

# root

# |– customer: struct (nullable = true)

# | |– age: long (nullable = true)

# | |– name: string (nullable = true)

# |– items: array (nullable = true)

# | |– element: struct ( containsNull = true)

# | | |– price: long (nullable = true)

# | | |– product: string (nullable = true)

# |– order_id : long (nullable = true)

# |– total: long (nullable = true)

df .show ( truncate = False )

Accessing Nested Fields

Dot Notation

# Access nested fields

df .select ( " order_id " , "customer.name" , " customer.age " ).show ()

# Output:

# +——–+——-+—+

# | order_id | name|age |

# +——–+——-+—+

# | 1 | Alice | 28|

# | 2| Bob| 32|

# +——–+——-+—+

Column Method

from pyspark .sql .functions import col

# Alternative syntax

df .select (

col ( " order_id " ),

col ( "customer.name" ),

col ( " customer.age " )

).show ()

Working with Arrays

Explode Arrays

from pyspark .sql .functions import explode

# Flatten array of items

df_exploded = df .select (

" order _id " ,

explode ( "items" ).alias ( "item" )

)

df_ exploded .show ()

# +——–+——————–+

# | order_id | item|

# +——–+——————–+

# | 1 |[ Laptop, 1000] |

# | 1 |[ Mouse, 50] |

# | 2 |[ Phone, 800] |

# +——–+——————–+

# Access nested fields in exploded array

df_ exploded .select (

" order _id " ,

" item.product " ,

" item.price "

).show ()

Multiline JSON Files

Regular JSON Format (Pretty Printed)

cat > config .json << 'EOF'

{

"database" : {

"host" : "localhost" ,

"port" : 5432 ,

"name" : " mydb "

} ,

"cache" : {

"enabled" : true ,

" ttl " : 3600

}

EOF

Read Multiline JSON

# For pretty-printed JSON files

df = spark .read .option ( "multiline" , "true" ).json ( " config.json " )

df .show ()

Complete Practical Example

Realistic API Response

cat > api_ response .jsonl << 'EOF'

{ "user_id" :1 , "user_name" :"Alice" , "email" :"alice@example.com" , "profile" :{ "age" :28 , "location" :"New York" , "interests " :[ " Python" , "Data Science" ] } , "purchases" :[ { "id" :"P001" , "product" :"Laptop" , "price" :1000 , "date" :"2024-01-15" } , { "id" :"P002" , "product" :"Mouse" , "price" :50 , "date" :"2024-01-20" } ] }

{ "user_id" :2 , "user_name" :"Bob" , "email" :"bob@example.com" , "profile " :{ "age" :32 , "location" :"San Francisco" , "interests " :[ "Java" , "Cloud" ] } , "purchases " :[ { "id" :"P003" , "product" :"Phone" , "price" :800 , "date" :"2024-02-01" } ] }

EOF

Parse and Analyze

from pyspark .sql .functions import explode , col , to_date

spark = SparkSession .builder .appName ( " ParseJSON " ).getOrCreate ()

df = spark .read .json ( " api_ response.jsonl " )

print ( "=== Schema ===" )

df .printSchema ()

print ( " n === Basic Info ===" )

df .select ( " user_id " , " user_name " , "email" ).show ()

print ( " n === Nested Profile ===" )

df .select (

" user _id " ,

" profile.age " ,

" profile.location "

).show ()

print ( " n === Exploded Interests ===" )

df .select (

" user _id " ,

explode ( " profile.interests " ).alias ( "interest" )

).show ()

print ( " n === Exploded Purchases ===" )

purchases = df .select (

" user _id " ,

" user _name " ,

explode ( "purchases" ).alias ( "purchase" )

)

purchases .select (

" user _id " ,

" user _name " ,

col ( " purchase.product " ),

col ( " purchase.price " )

).show ()

spark .stop ()

Common Patterns

Pattern 1: Flatten Nested Structure

from pyspark .sql .functions import col

df_flat = df .select (

" order _id " ,

col ( "customer.name" ).alias ( " customer_name " ),

col ( " customer.age " ).alias ( " customer_age " ),

col ( "total" )

)

df_ flat .show ()

Pattern 2: Working with Complex Nesting

from pyspark .sql .functions import explode

# Multiple levels of nesting

df_processed = df .select (

" user _id " ,

col ( " profile.location " ).alias ( "location" ),

explode ( " profile.interests " ).alias ( "interest" ),

explode ( "purchases" ).alias ( "purchase" )

)

df_ processed .select (

" user _id " ,

"location" ,

"interest" ,

col ( " purchase.product " ),

col ( " purchase.price " )

).show ()

💡 Key Points

• JSON files can have nested objects (struct) and arrays

• Use dot notation or col( ) to access nested fields

• Use explode( ) to flatten arrays

• multiLine =true for pretty-printed JSON

• Schema is automatically inferred for JSON

📚 Study Notes

• JSON lines: One JSON object per line (efficient)

• Multiline JSON: Pretty-printed JSON (single object per file)

• Nested objects: Access with dot notation (customer.name)

• Arrays: Use explode( ) to flatten

• Column access: df [" col.nested "] or col(" col.nested ")

• Complex nesting: Combine multiple explode( ) calls

• Performance: explode( ) can increase row count significantly

Navigation

Apache Spark Practicals

Leave a Reply Cancel reply