Filtering Records Using Conditions
Filtering is the most common operation in data processing, allowing you to work with subsets of data that meet specific criteria—mastering filter conditions enables you to solve real-world problems like finding high earners, active users, or recent transactions—understanding how to combine multiple conditions with AND, OR, and NOT logic transforms you into a proficient data analyst.
Basic Filtering
Simple Comparison Operators
from pyspark .sql import SparkSession
from pyspark .sql .functions import col
spark = SparkSession .builder .appName ( " FilterData " ).getOrCreate ()
df = spark .read .csv ( "employees.csv" , header = True , inferSchema = True )
# Greater than
df .filter ( col ( "salary" ) > 80000 ).show ()
# Less than
df .filter ( col ( "age" ) < 30 ).show ()
# Equal to
df .filter ( col ( "department" ) == "Engineering" ).show ()
# Not equal
df .filter ( col ( "status" ) != "inactive" ).show ()
# Greater than or equal
df .filter ( col ( "salary" ) >= 75000 ).show ()
# Less than or equal
df .filter ( col ( "age" ) <= 35 ).show ()
Combining Conditions
AND Operator (&)
# Multiple conditions (all must be true)
df .filter (
( col ( "age" ) > 25 ) & ( col ( "salary" ) > 70000 )
).show ()
# More complex AND
df .filter (
( col ( "department" ) == "Engineering" ) &
( col ( "salary" ) > 80000 ) &
( col ( " years_employed " ) > 2 )
).show ()
OR Operator (|)
# Multiple conditions (any can be true)
df .filter (
( col ( "department" ) == "Engineering" ) |
( col ( "department" ) == "Sales" )
).show ()
# Salary is high OR years_employed > 5
df .filter (
( col ( "salary" ) > 100000 ) |
( col ( " years_employed " ) > 5 )
).show ()
NOT Operator (~)
# Negate a condition
df .filter (~( col ( "department" ) == "Sales" ) ).show ()
# Equivalent to:
df .filter ( col ( "department" ) != "Sales" ).show ()
# NOT with complex condition
df .filter (
~(( col ( "age" ) < 25 ) & ( col ( "salary" ) < 60000 ))
).show ()
String Matching
LIKE (Pattern Matching)
# Starts with
df .filter ( col ( "name" ).like ( "A%" ) ).show () # Names starting with A
# Ends with
df .filter ( col ( "email" ).like ( "%.com" ) ).show () # Email ending with .com
# Contains
df .filter ( col ( "email" ).like ( " % g mail %" ) ).show () # Email containing gmail
# Pattern matching
df .filter ( col ( "phone" ).like ( "555-____" ) ).show () # Phone pattern
isin ( ) – Multiple Values
# Match any value in list
df .filter ( col ( "department" ).isin ( "Engineering" , "Sales" , "Management" ) ).show ()
# Using list
departments = [ "Engineering" , "Sales" , "Management" ]
df .filter ( col ( "department" ).isin (* departments ) ).show ()
# Salary is one of specific values
df .filter ( col ( "salary" ).isin ( 75000 , 80000 , 85000 ) ).show ()
NULL Handling
Check for NULL
# Is NULL
df .filter ( col ( " manager_id " ).isNull () ).show () # No manager assigned
# Is NOT NULL
df .filter ( col ( "phone" ).isNotNull () ).show () # Has phone number
# Multiple NULL checks
df .filter (
( col ( "email" ).isNotNull ()) &
( col ( "phone" ).isNotNull ())
).show ()
Complex Filter Conditions
Nested AND/OR
# (A AND B) OR (C AND D)
df .filter (
(
( col ( "age" ) > 30 ) & ( col ( "salary" ) > 80000 )
) | (
( col ( " years_employed " ) > 5 ) & ( col ( "department" ) == "Engineering" )
)
).show ()
Practical Example: Complex Filtering
# Create sample employee data
cat > employees .csv << 'EOF'
id , name , age , salary , department , years , status
1 , Alice , 28 , 75000 , Engineering , 3 , active
2 , Bob , 32 , 85000 , Engineering , 5 , active
3 , Charlie , 25 , 65000 , Sales , 2 , inactive
4 , Diana , 35 , 90000 , Engineering , 7 , active
5 , Eve , 29 , 80000 , Sales , 4 , active
6 , Frank , 31 , 88000 , Engineering , 6 , active
7 , Grace , 26 , 70000 , Marketing , 1 , active
8 , Henry , 33 , 95000 , Engineering , 8 , active
EOF
# Find: Active Engineering employees with salary > 80k
result = df .filter (
( col ( "department" ) == "Engineering" ) &
( col ( "salary" ) > 80000 ) &
( col ( "status" ) == "active" )
)
print ( f "Found { result .count () } employees" )
result .show ()
# Result:
# +—+—-+—+——+———–+—–+——+
# | id|name|age|salary | department|years|status |
# +—+—-+—+——+———–+—–+——+
# | 2 | Bob| 32| 85000|Engineering| 5|active|
# | 4 |Diana| 35| 90000|Engineering| 7|active|
# | 6 |Frank| 31| 88000|Engineering| 6|active|
# | 8 |Henry| 33| 95000|Engineering| 8|active|
# +—+—-+—+——+———–+—–+——+
String Operations with Filters
Case-Insensitive Matching
from pyspark .sql .functions import lower
# Case-insensitive search
df .filter ( lower ( col ( "name" ) ).like ( " alice %" ) ).show ()
# Department matching (case-insensitive)
df .filter ( lower ( col ( "department" )) == "engineering" ).show ()
String Length Filtering
from pyspark .sql .functions import length
# Names longer than 5 characters
df .filter ( length ( col ( "name" )) > 5 ).show ()
# Emails with at least 15 characters
df .filter ( length ( col ( "email" )) >= 15 ).show ()
Numeric Filtering
Between Range
from pyspark .sql .functions import between
# Salary between 70000 and 90000
df .filter ( between ( col ( "salary" ), 70000 , 90000 ) ).show ()
# Age between 25 and 35
df .filter ( between ( col ( "age" ), 25 , 35 ) ).show ()
Modulo/Remainder
# ID is even
df .filter (( col ( "id" ) % 2 ) == 0 ).show ()
# ID is odd
df .filter (( col ( "id" ) % 2 ) != 0 ).show ()
Performance Considerations
Filter Early and Often
# GOOD: Filter early
df_active = df .filter ( col ( "status" ) == "active" )
result = df_ active .filter ( col ( "salary" ) > 80000 ).select ( "name" , "salary" )
# BAD: Select before filtering
result = df .select ( "name" , "salary" ).filter ( col ( "salary" ) > 80000 )
# Extra columns loaded into memory unnecessarily
Combine Filters Efficiently
# GOOD: One filter with multiple conditions
df .filter (
( col ( "age" ) > 25 ) &
( col ( "salary" ) > 70000 ) &
( col ( "status" ) == "active" )
).show ()
# LESS EFFICIENT: Multiple filter calls
df .filter ( col ( "age" ) > 25 )
.filter ( col ( "salary" ) > 70000 )
.filter ( col ( "status" ) == "active" )
.show ()
Common Filtering Patterns
Pattern 1: Find Duplicates
from pyspark .sql .functions import count , window
# Find names that appear more than once
name_counts = df .groupBy ( "name" ).count ()
duplicates = name_ counts .filter ( col ( "count" ) > 1 )
Pattern 2: Top N Percent
# Top 20% by salary
percentile_80 = df .approxQuantile ( "salary" , [ 0.8 ], 0.05 )
df_top = df .filter ( col ( "salary" ) >= percentile_80 )
Pattern 3: Outliers
from pyspark .sql .functions import mean , stddev
# Find salaries > 2 standard deviations from mean
mean_val = df .agg ( { "salary" : " avg " } ).collect ()
stddev_val = df .agg ( { "salary" : " stddev " } ).collect ()
outliers = df .filter (
( col ( "salary" ) > mean_val + ( 2 * stddev_val )) |
( col ( "salary" ) < mean_val – ( 2 * stddev_val ))
)
💡 Best Practices
• Use column objects ( col( )) not strings for complex filters
• Combine conditions in single filter( ) when possible
• Filter early in your pipeline
• Use appropriate operators: & for AND, | for OR, ~ for NOT
• Parenthesize clearly: Make logic obvious
• Test filters with smaller datasets first
• Use isNull / isNotNull for NULL checks, not ==
📚 Study Notes
• filter( ): Apply WHERE conditions to DataFrame
• Comparison: >, <, == , !=, >=, <=
• Logical: & (AND), | (OR), ~ (NOT)
• String: like( ), isin ( )
• NULL: isNull ( ), isNotNull ( )
• Functions: between( ), length( ), lower( )
• Performance: Filter early and combine conditions