ETL vs ELT
Understanding the difference between ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) is fundamental to modern data architecture decisions—ETL dominated for decades but ELT has emerged as the preferred approach for cloud data warehouses and big data platforms, offering flexibility, scalability, and cost advantages—this topic explores both paradigms, their trade-offs, and when to use each approach.
ETL: Extract, Transform, Load
Definition and Workflow
Extract → Transform → Load
↓ ↓ ↓
Read data Process Write
from in results
source middleware to target
How ETL Works
Step 1: Extract
Step 2: Transform
Step 3: Load
ETL Architecture
Data Sources
├─ Database 1
├─ Database 2
├─ Files
└─ APIs
↓
ETL Tool
(Informatica, Talend, Custom Spark)
├─ Extract
├─ Transform (Processing Server)
│ ├ ─ Cleansing
│ ├ ─ Joining
│ ├ ─ Aggregation
│ └ ─ Validation
└─ Load
↓
Data Warehouse
↓
BI Tools / Reports
Classic ETL Example
from pyspark .sql import SparkSession
from pyspark .sql .functions import col , upper , concat , when
spark = SparkSession .builder .appName ( " ClassicETL " ).getOrCreate ()
# EXTRACT: Read from source
df_customers = spark .read .jdbc ( " jdbc:mysql ://source:3306/ crm " ,
"customers" )
df_orders = spark .read .parquet ( " hdfs ://data/orders/" )
# TRANSFORM: Process in middleware
# Clean customer names
df_customers = df_ customers .withColumn (
" name _clean " ,
upper ( concat ( col ( " first_name " ), col ( " last_name " )))
)
# Filter valid orders
df_orders = df_ orders .filter ( col ( "status" ) == "completed" )
# Join and aggregate
df_processed = df_ customers .join ( df_orders , " customer_id " )
.groupBy ( " customer_id " , " name_clean " )
.agg ( { "amount" : "sum" } )
# LOAD: Write to data warehouse
df_ processed .write .mode ( "overwrite" )
.jdbc ( " jdbc:postgresql ://dw:5432/warehouse " , " customer_summary " )
spark .stop ()
ELT: Extract, Load, Transform
Definition and Workflow
Extract → Load → Transform
↓ ↓ ↓
Read Write Process
data raw in
from data warehouse
source to SQL
target
How ELT Works
Step 1: Extract
Step 2: Load
Step 3: Transform
ELT Architecture
Data Sources
├─ Database 1
├─ Database 2
├─ Files
└─ APIs
↓
Load (Minimal)
↓
Cloud Data Warehouse (Snowflake/ BigQuery /Redshift)
│
├─ Raw Schema
│ ├ ─ Raw customer data
│ └ ─ Raw order data
│
├─ Staging Schema
│ ├ ─ Cleaned data
│ └ ─ Deduplicated
│
└─ Processed Schema
├─ Customer summary
└─ Aggregate tables
↓
BI Tools / Reports
Modern ELT Example
# Extract and Load (minimal processing)
spark = SparkSession .builder .appName ( " ModernELT " ).getOrCreate ()
# EXTRACT: Read from source
df_customers = spark .read .jdbc ( " jdbc:mysql ://source:3306/ crm " ,
"customers" )
df_orders = spark .read .parquet ( " hdfs ://data/orders/" )
# LOAD: Write raw data to warehouse (no transformation)
df_ customers .write .mode ( "overwrite" )
.option ( "schema" , "raw" )
.parquet ( "s3://warehouse/raw/customers/" )
df_ orders .write .mode ( "overwrite" )
.option ( "schema" , "raw" )
.parquet ( "s3://warehouse/raw/orders/" )
# TRANSFORM: Happens in Snowflake SQL
snowflake_queries = """
— Create cleaned customer view
CREATE TABLE processed.customer _summary AS
SELECT
customer_id ,
UPPER(CONCAT( first_name , ' ', last_name )) AS name,
email,
address
FROM raw.customers
WHERE status = 'active';
— Join and aggregate
CREATE TABLE processed.customer _orders AS
SELECT
c.customer_id ,
c.name,
COUNT( o.order _id ) AS total_orders ,
SUM( o.amount ) AS total_spent ,
AVG( o.amount ) AS avg_order_value
FROM processed.customer _summary c
LEFT JOIN raw.orders o ON c.customer _id = o.customer _id
WHERE o.status = 'completed'
GROUP BY 1, 2;
"""
ETL vs ELT: Detailed Comparison
|
Aspect |
ETL |
ELT |
|---|---|---|
|
Processing Location |
External tool/server |
Data warehouse |
|
Data Stored |
Processed only |
Raw + Processed |
|
Storage Cost |
Lower |
Higher (keep raw) |
|
Processing Cost |
Higher (external tool) |
Lower (warehouse) |
|
Time to Load |
Longer (transform first) |
Shorter (load first) |
|
Flexibility |
Lower (fixed transforms) |
Higher (SQL changes) |
|
Data Quality |
Enforced in pipeline |
Documented in warehouse |
|
Historical Data |
Limited |
Full history available |
|
Learning Curve |
Complex tools |
SQL knowledge |
|
Best For |
Legacy systems |
Cloud data warehouses |
When to Use ETL
ETL is Better When:
Example: Banking System
Bank transaction system → ETL transforms (validation, fraud check) → Load to DW
(Cannot load raw transaction data due to regulatory requirements)
When to Use ELT
ELT is Better When:
Example: E-commerce Analytics
Product catalog → Load raw to S3 → Snowflake transformations (SQL)
(Analysts write SQL to create different product dimensions)
Hybrid Approach: ETL + ELT
Modern systems often combine both:
Data Sources
├─ Simple sources → ELT (load raw, transform in DW)
└─ Complex sources → ETL (transform before load)
↓
Data Lake / Cloud Storage
↓
Data Warehouse
↓
BI / Analytics
Real-World Decision Matrix
|
Scenario |
Approach |
Reason |
|---|---|---|
|
CRM to Salesforce DW |
ELT |
Cloud-native, simple schema |
|
IoT sensors to analytics |
ELT |
High volume, cheap storage |
|
Bank transactions |
ETL |
Regulatory, privacy |
|
Social media to DW |
ELT |
Flexible, changing schema |
|
POS to data mart |
ETL |
Real-time, complex rules |
Tools and Technologies
ETL Tools
ELT Tools
Best Practices
For ETL:
For ELT:
💡 Key Insights
• Cloud shifted the balance: ELT now preferred for cloud DWs
• Storage is cheap: Keep raw data for reprocessing
• Flexibility wins: Business rules change frequently
• Audit trails matter: Raw data history valuable
• Hybrid is realistic: Different sources, different approaches
📚 Study Notes
• ETL: Transform first, load processed data
• ELT: Load raw first, transform in warehouse
• ETL pros: Privacy, early validation
• ELT pros: Flexibility, cheaper, full history
• Modern preference: ELT for cloud systems
• Hybrid: Use both for different data sources