Filtering Records Using Conditions

Filtering is the most common operation in data processing, allowing you to work with subsets of data that meet specific criteria—mastering filter conditions enables you to solve real-world problems like finding high earners, active users, or recent transactions—understanding how to combine multiple conditions with AND, OR, and NOT logic transforms you into a proficient data analyst.

Basic Filtering

Simple Comparison Operators

from pyspark .sql import SparkSession

from pyspark .sql .functions import col

spark = SparkSession .builder .appName ( " FilterData " ).getOrCreate ()

df = spark .read .csv ( "employees.csv" , header = True , inferSchema = True )

# Greater than

df .filter ( col ( "salary" ) > 80000 ).show ()

# Less than

df .filter ( col ( "age" ) < 30 ).show ()

# Equal to

df .filter ( col ( "department" ) == "Engineering" ).show ()

# Not equal

df .filter ( col ( "status" ) != "inactive" ).show ()

# Greater than or equal

df .filter ( col ( "salary" ) >= 75000 ).show ()

# Less than or equal

df .filter ( col ( "age" ) <= 35 ).show ()

Combining Conditions

AND Operator (&)

# Multiple conditions (all must be true)

df .filter (

( col ( "age" ) > 25 ) & ( col ( "salary" ) > 70000 )

).show ()

# More complex AND

df .filter (

( col ( "department" ) == "Engineering" ) &

( col ( "salary" ) > 80000 ) &

( col ( " years_employed " ) > 2 )

).show ()

OR Operator (|)

# Multiple conditions (any can be true)

df .filter (

( col ( "department" ) == "Engineering" ) |

( col ( "department" ) == "Sales" )

).show ()

# Salary is high OR years_employed > 5

df .filter (

( col ( "salary" ) > 100000 ) |

( col ( " years_employed " ) > 5 )

).show ()

NOT Operator (~)

# Negate a condition

df .filter (~( col ( "department" ) == "Sales" ) ).show ()

# Equivalent to:

df .filter ( col ( "department" ) != "Sales" ).show ()

# NOT with complex condition

df .filter (

~(( col ( "age" ) < 25 ) & ( col ( "salary" ) < 60000 ))

).show ()

String Matching

LIKE (Pattern Matching)

# Starts with

df .filter ( col ( "name" ).like ( "A%" ) ).show () # Names starting with A

# Ends with

df .filter ( col ( "email" ).like ( "%.com" ) ).show () # Email ending with .com

# Contains

df .filter ( col ( "email" ).like ( " % g mail %" ) ).show () # Email containing gmail

# Pattern matching

df .filter ( col ( "phone" ).like ( "555-____" ) ).show () # Phone pattern

isin ( ) – Multiple Values

# Match any value in list

df .filter ( col ( "department" ).isin ( "Engineering" , "Sales" , "Management" ) ).show ()

# Using list

departments = [ "Engineering" , "Sales" , "Management" ]

df .filter ( col ( "department" ).isin (* departments ) ).show ()

# Salary is one of specific values

df .filter ( col ( "salary" ).isin ( 75000 , 80000 , 85000 ) ).show ()

NULL Handling

Check for NULL

# Is NULL

df .filter ( col ( " manager_id " ).isNull () ).show () # No manager assigned

# Is NOT NULL

df .filter ( col ( "phone" ).isNotNull () ).show () # Has phone number

# Multiple NULL checks

df .filter (

( col ( "email" ).isNotNull ()) &

( col ( "phone" ).isNotNull ())

).show ()

Complex Filter Conditions

Nested AND/OR

# (A AND B) OR (C AND D)

df .filter (

(

( col ( "age" ) > 30 ) & ( col ( "salary" ) > 80000 )

) | (

( col ( " years_employed " ) > 5 ) & ( col ( "department" ) == "Engineering" )

)

).show ()

Practical Example: Complex Filtering

# Create sample employee data

cat > employees .csv << 'EOF'

id , name , age , salary , department , years , status

1 , Alice , 28 , 75000 , Engineering , 3 , active

2 , Bob , 32 , 85000 , Engineering , 5 , active

3 , Charlie , 25 , 65000 , Sales , 2 , inactive

4 , Diana , 35 , 90000 , Engineering , 7 , active

5 , Eve , 29 , 80000 , Sales , 4 , active

6 , Frank , 31 , 88000 , Engineering , 6 , active

7 , Grace , 26 , 70000 , Marketing , 1 , active

8 , Henry , 33 , 95000 , Engineering , 8 , active

EOF

# Find: Active Engineering employees with salary > 80k

result = df .filter (

( col ( "department" ) == "Engineering" ) &

( col ( "salary" ) > 80000 ) &

( col ( "status" ) == "active" )

)

print ( f "Found { result .count () } employees" )

result .show ()

# Result:

# +—+—-+—+——+———–+—–+——+

# +—+—-+—+——+———–+—–+——+

# | 2 | Bob| 32| 85000|Engineering| 5|active|

# | 4 |Diana| 35| 90000|Engineering| 7|active|

# | 6 |Frank| 31| 88000|Engineering| 6|active|

# | 8 |Henry| 33| 95000|Engineering| 8|active|

# +—+—-+—+——+———–+—–+——+

String Operations with Filters

Case-Insensitive Matching

from pyspark .sql .functions import lower

# Case-insensitive search

df .filter ( lower ( col ( "name" ) ).like ( " alice %" ) ).show ()

# Department matching (case-insensitive)

df .filter ( lower ( col ( "department" )) == "engineering" ).show ()

String Length Filtering

from pyspark .sql .functions import length

# Names longer than 5 characters

df .filter ( length ( col ( "name" )) > 5 ).show ()

# Emails with at least 15 characters

df .filter ( length ( col ( "email" )) >= 15 ).show ()

Numeric Filtering

Between Range

from pyspark .sql .functions import between

# Salary between 70000 and 90000

df .filter ( between ( col ( "salary" ), 70000 , 90000 ) ).show ()

# Age between 25 and 35

df .filter ( between ( col ( "age" ), 25 , 35 ) ).show ()

Modulo/Remainder

# ID is even

df .filter (( col ( "id" ) % 2 ) == 0 ).show ()

# ID is odd

df .filter (( col ( "id" ) % 2 ) != 0 ).show ()

Performance Considerations

Filter Early and Often

# GOOD: Filter early

df_active = df .filter ( col ( "status" ) == "active" )

result = df_ active .filter ( col ( "salary" ) > 80000 ).select ( "name" , "salary" )

# BAD: Select before filtering

result = df .select ( "name" , "salary" ).filter ( col ( "salary" ) > 80000 )

# Extra columns loaded into memory unnecessarily

Combine Filters Efficiently

# GOOD: One filter with multiple conditions

df .filter (

( col ( "age" ) > 25 ) &

( col ( "salary" ) > 70000 ) &

( col ( "status" ) == "active" )

).show ()

# LESS EFFICIENT: Multiple filter calls

df .filter ( col ( "age" ) > 25 )

.filter ( col ( "salary" ) > 70000 )

.filter ( col ( "status" ) == "active" )

.show ()

Common Filtering Patterns

Pattern 1: Find Duplicates

from pyspark .sql .functions import count , window

# Find names that appear more than once

name_counts = df .groupBy ( "name" ).count ()

duplicates = name_ counts .filter ( col ( "count" ) > 1 )

Pattern 2: Top N Percent

# Top 20% by salary

percentile_80 = df .approxQuantile ( "salary" , [ 0.8 ], 0.05 )

df_top = df .filter ( col ( "salary" ) >= percentile_80 )

Pattern 3: Outliers

from pyspark .sql .functions import mean , stddev

# Find salaries > 2 standard deviations from mean

mean_val = df .agg ( { "salary" : " avg " } ).collect ()

stddev_val = df .agg ( { "salary" : " stddev " } ).collect ()

outliers = df .filter (

( col ( "salary" ) > mean_val + ( 2 * stddev_val )) |

( col ( "salary" ) < mean_val – ( 2 * stddev_val ))

)

💡 Best Practices

• Use column objects ( col( )) not strings for complex filters

• Combine conditions in single filter( ) when possible

• Filter early in your pipeline

• Use appropriate operators: & for AND, | for OR, ~ for NOT

• Parenthesize clearly: Make logic obvious

• Test filters with smaller datasets first

• Use isNull / isNotNull for NULL checks, not ==

📚 Study Notes

• filter( ): Apply WHERE conditions to DataFrame

• Comparison: >, <, == , !=, >=, <=

• Logical: & (AND), | (OR), ~ (NOT)

• String: like( ), isin ( )

• NULL: isNull ( ), isNotNull ( )

• Functions: between( ), length( ), lower( )

• Performance: Filter early and combine conditions

Navigation

Apache Spark Practicals

Leave a Reply Cancel reply