10 Pandas SQL Queries to Use Frequently for Transformation

Pandas makes it easy to search and filter data with DataFrame and Series. It has tools for changing data. Here are the Pandas SQL queries for Dataframes.

Pandas SQL queries
Photo by Carlos Montelara on Pexels.com

Table of contents

  1. Reading CSV file and display all the columns
  2. Get specific columns
  3. Filter rows on a condition
  4. Sort the result set
  5. Get unique values
  6. Group by rows by a specific column
  7. Calculate average for each group
  8. Count number of rows in each group
  9. Pandas to_sql
  10. Conclusion

Reading CSV file and display all the columns

#Pandas: Use the DataFrame directly.
import pandas as pd
df = pd.read_csv('your_data.csv') # Load your data into a DataFrame
display(df)

#Equivalent SQL query:
SELECT column1, column2 FROM table_name

Get specific columns

#Pandas: Specify the column names as a list when indexing the DataFrame.
selected_columns = df[['column1', 'column2']]

#Equivalent SQL query:
SELECT * FROM table_name WHERE condition

Filter rows on a condition

#Pandas: Use boolean indexing to filter rows based on a condition.
filtered_df = df[df['column_name'] > 10] # Example condition

#Equivalent SQL query:
SELECT * FROM table_name ORDER BY column_name ASC/DESC

Sort the result set

#Pandas: Use the sort_values method.
sorted_df = df.sort_values(by='column_name', ascending=False)

#Equivalent SQL query:
SELECT DISTINCT column_name FROM table_name

Get unique values

#Pandas: Use the unique method on a Series.
unique_values = df['column_name'].unique()

#Equivalent SQL query:
GROUP BY column_name

Group by rows by a specific column

#Pandas: Use the groupby method.
grouped = df.groupby('column_name')

#Equivalent SQL query:
SELECT AVG(column_name) FROM table_name GROUP BY column_name

Calculate average for each group

#Pandas: Use the groupby and agg functions.
avg_by_group = df.groupby('column_name')['other_column'].agg('mean')

#Equivalent SQL query:
SELECT uid, result, AVG(value) AS v_1 
FROM data
GROUP BY uid, result;

Count number of rows in each group

# Pandas: Use the groupby and size functions.
count_by_group = df.groupby('column_name').size()

#Equivalent SQL query:
SELECT COUNT(*) FROM table_name GROUP BY column_name

Pandas to_sql

Here’s an example of using pandas to write a DataFrame to a SQL database table.

import pandas as pd
from sqlalchemy import create_engine

# Create a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [28, 32, 45],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)

# Create a SQLite database engine
engine = create_engine('sqlite:///sample.db')

# Write the DataFrame to a SQL table
df.to_sql('persons', con=engine, if_exists='replace', index=False)

# Query the newly created table
result = pd.read_sql('SELECT * FROM persons', con=engine)
print(result)

Output

     Name  Age      City
0 John 28 New York
1 Alice 32 Paris
2 Bob 45 London

In this example, we first create a DataFrame df with sample data. Then, we create a SQLite database engine using the create_engine function and specify the database connection URL. Next, we use the to_sql method of the DataFrame to write the data to a table named ‘persons’ in the database specified by the engine. The if_exists parameter is set to ‘replace’ to replace the table if it already exists. Finally, we use pd.read_sql to query the ‘persons’ table and print the result.

Conclusion

Pandas is a powerful and flexible tool in Python for data manipulation and analysis. It provides numerous functionalities beyond the covered SQL operations.

Author: Srini

Experienced Data Engineer, having skills in PySpark, Databricks, Python SQL, AWS, Linux, and Mainframe