1. Introduction
In this tutorial, we’ll learn how to filter a Pandas dataframe (an instance of Pandas.DataFrame), similar to filtering a table using the WHERE clause in SQL.
We’ll learn three methods of doing so:
- Boolean indexing
- the query() method
- SQL interfaces
They’re important to learn because Pandas is used in many tasks ranging from data science to training ML models.
We’ll focus on the case where the condition involves a single column, and the approaches can be adapted to work with multiple columns as well. We tested the code using Pandas 2.2.1, but it should work on the other modern versions of Pandas, too.
2. Data
Here’s the data frame we’ll use in our examples:
import pandas as pd
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 1, 2],
'b' : ['VAA', 'WBB', 'XCC', 'YDD', 'ZAA', 'SBC', 'TCC'],
'c' : pd.to_datetime(['2025-01-01', '2025-03-01', '2024-08-04',
'2024-11-08', '2023-09-02', '2025-02-05', '2025-01-07'])
})
It has an integer column a, a string column b, and a DateTime column c.
3. Boolean Indexing
We can create a Boolean mask by evaluating relational expressions involving columns.
The general syntax is:
df[col] op value
# or
value op df[col]
where:
- df is a dataframe
- col is the column’s name as a string
- op is a binary operator that evaluates to True or False
- value is a value of the same type as the elements of the column df[col]
For each element x of df[col], Pandas internally evaluates x op value and returns a Boolean mask of True and False. We can use it as a logical index to retrieve only the rows where the mask is True:
df[df[col] op value]
Let’s check a few examples.
3.1. Numerical Columns
We can use all the relational operators in Python to compare a column to a value: <, >, <=, >=, ==, and !=. For example:
df[df['a'] < 3]
returns
a b c
0 1 VAA 2025-01-01
1 2 WBB 2025-03-01
5 1 SBC 2025-02-05
6 2 TCC 2025-01-07
If we want to get the columns where a condition isn’t fulfilled, we use ~ to flip the Boolean mask (convert True to False and False to True):
df[~(df['a'] < 3)]
We get:
a b c
2 3 XCC 2024-08-04
3 4 YDD 2024-11-08
4 5 ZAA 2023-09-02
We must put the original mask in the parentheses for the correct results.
If we want to test for equality with several values, we should use the isin operator:
df[df['a'].isin([1, 5])]
The result is:
a b c
0 1 VAA 2025-01-01
4 5 ZAA 2023-09-02
5 1 SBC 2025-02-05
The operator accepts an iterable of values and checks if column elements are its members.
3.2. String Columns
We can use said relational operators with string columns and string values.
However, we must reference the str accessor to use string-specific methods such as contains(), lower(), upper(), startswith(), endswith(), split(), and others:
# df['b'] ends with 'C'
df[df['b'].str.endswith('C')]
# df['b'] contains 'C'
df[df['b'].str.contains('DD')]
# df['b'], transformed to lowercase, contains 'bc'
df[df['b'].str.lower().str.contains('bc')]
Here’s a list of commonly used methods:
- str.contains(s) – checks if column elements contain s as their substring
- str.lower() – converts characters to lowercase
- str.upper() – converts characters to uppercase
- str.startswith(s) – checks if column elements start with substring s
- str.endswith(s) – checks if column elements end with substring s
- str.split(delimiter) – splits using said delimiter
3.3. DateTime Columns
Similarly, we access DateTime methods and properties using the dt accessor:
# df['c'] year is odd
df[df['c'].dt.year % 2 == 1]
# df['c'] month is September
df[df['c'].dt.month == 9]
# or
df[df['c'].dt.month_name() == 'September']
# df['c'] weekday is Monday, Tuesday or Friday
df['c'].weekday.isin([1, 2, 5])
# or
df['c'].dt.day_name().isin(['Monday', 'Tuesday', 'Friday'])
Here’s a list of commonly used constructs:
- dt.year, dt.month, dt.day – extracts the year, month, and day of the dates
- dt.weekday – returns the weekdays: 1 for Monday, 2 for Tuesday, …
- dt.month_name() – returns the month names
- dt.day_name() – returns the day names
- dt.isocalendar() – returns triples (year, week in the year, day in the week)
4. Using the query() Method
The query() method of Pandas.DataFrame can shorten the code with Boolean indexing. Instead of creating a mask and using it as a logical index, we write conditions as strings referencing column names and values. For example:
df.query('a < 3')
is equivalent to df[df[‘a’] < 3] but shorter.
If a column name contains spaces, we enclose it with backticks:
# Create a new column
df['a new column'] = df['a'] + 1
Here’s the updated dataframe:
a b c a new column
0 1 VAA 2025-01-01 2
1 2 WBB 2025-03-01 3
2 3 XCC 2024-08-04 4
3 4 YDD 2024-11-08 5
4 5 ZAA 2023-09-02 6
5 1 SBC 2025-02-05 2
6 2 TCC 2025-01-07 3
Now, let’s filter it by the new column, whose name contains spaces:
df.query('`a new column` < 3')
The result is:
a b c a new column
0 1 VAA 2025-01-01 2
5 1 SBC 2025-02-05 2
</code
*We can use a Python variable in the query string if we prepend its name with @:*
x = 3
df.query('a < @x')
This returns:
a b c a new column
0 1 VAA 2025-01-01 2
1 2 WBB 2025-03-01 3
5 1 SBC 2025-02-05 2
6 2 TCC 2025-01-07 3
We’ll get UndefinedVariableError if the referenced variable doesn’t exist in the current scope.
5. SQL-Like Queries
Several SQL interfaces allow us to query a Pandas dataframe like an SQL table. We’ll show how to use DuckDB.
We can install it using pip:
pip install duckdb
Then, we treat the dataframe variable as a table name whose columns we can use in WHERE and SELECT clauses. We pass the query string to the query() function of the duckdb module:
import duckdb
result = duckdb.query('SELECT * FROM df WHERE a < 3')
The result variable is a DuckDB class for table relations, whose string form is formatted as a table:
┌───────┬─────────┬─────────────────────┬──────────────┐
│ a │ b │ c │ a new column │
│ int64 │ varchar │ timestamp_ns │ int64 │
├───────┼─────────┼─────────────────────┼──────────────┤
│ 1 │ VAA │ 2025-01-01 00:00:00 │ 2 │
│ 2 │ WBB │ 2025-03-01 00:00:00 │ 3 │
│ 1 │ SBC │ 2025-02-05 00:00:00 │ 2 │
│ 2 │ TCC │ 2025-01-07 00:00:00 │ 3 │
└───────┴─────────┴─────────────────────┴──────────────┘
We can call the df() methods to convert it to Pandas.DataFrame:
df_result = duckdb.query('SELECT * FROM df WHERE a < 3').df()
6. Comparison
Which approach is the fastest? It depends on the dataframe size, the query, and the environment. In our test, we created a dataframe with five integer columns and a million rows of random values:
import numpy as npimport pandas as pd
generator = np.random.default_rng(19)
random_data = generator.choice(100, size=(1000000, 5))
df = pd.DataFrame(random_data, columns=['a', 'b', 'c', 'd', 'e'])
We used timeit with 30 runs and automatically determined number of loops to compare run times:
Method
Mean ± std. dev. of 30 runs
Loops in a Run
duckdb.query(‘SELECT * FROM df WHERE a < 20’)
722 µs ± 298 µs per loop
1000
df[df[‘a’] < 20]
9.3 ms ± 1.76 ms per loop
100
df.query(‘a < 20’)
24.6 ms ± 6.16 ms per loop
10
In this particular test, DuckDb was the fastest, while boolean indexing worked faster than the query() method.
7. Conclusion
In this article, we explored how to filter a Pandas dataframe based on a single column.
We presented three methods: boolean indexing, the query() method of Pandas.DataFrame, and the SQL interface DuckDb.
In our benchmark test, DuckDb worked the best, but that need not be the case in every particular application.