Top 8 Python Data Engineering Interview Questions with SQL and Pandas Solutions

Here are the SQL and Python(Pandas) interview questions recently asked in an interview.

SQL interview Questions

1️⃣ Top 5 Customers with Highest Total Order Amount

Question:
Find the top 5 customers with the highest total order amount. Include the total sales for each customer and rank them using row numbers.

Table Details

Customers Table

Column	Type	Description
CustomerID	INT	Unique identifier for the customer
CustomerName	NVARCHAR	Name of the customer

Orders Table

Column	Type	Description
OrderID	INT	Unique identifier for the order
CustomerID	INT	Foreign key referencing Customers
OrderDate	DATETIME	Date and time when order was placed

OrderDetails Table

Column	Type	Description
OrderDetailID	INT	Unique identifier for order detail
OrderID	INT	Foreign key referencing Orders
ProductID	INT	Foreign key referencing Products
Quantity	INT	Quantity of product ordered
UnitPrice	DECIMAL	Unit price of product

Products Table

Column	Type	Description
ProductID	INT	Unique identifier for the product
ProductName	NVARCHAR	Name of the product
UnitPrice	DECIMAL	Unit price of the product

Solution

WITH customer_totals AS (
    SELECT 
        c.CustomerID,
        c.CustomerName,
        SUM(od.Quantity * od.UnitPrice) AS TotalAmount
    FROM Customers c
    JOIN Orders o ON c.CustomerID = o.CustomerID
    JOIN OrderDetails od ON o.OrderID = od.OrderID
    GROUP BY c.CustomerID, c.CustomerName
),
ranked_customers AS (
    SELECT 
        CustomerID,
        CustomerName,
        TotalAmount,
        ROW_NUMBER() OVER (ORDER BY TotalAmount DESC) AS row_num
    FROM customer_totals
)
SELECT *
FROM ranked_customers
WHERE row_num <= 5;

2️⃣ List All Customers with Total Orders and Average Order Amount

Question:
List all customers along with their total number of orders and average order amount. Include customers with zero orders.

WITH order_totals AS (
    SELECT 
        o.OrderID,
        o.CustomerID,
        SUM(od.Quantity * od.UnitPrice) AS OrderTotal
    FROM Orders o
    JOIN OrderDetails od ON o.OrderID = od.OrderID
    GROUP BY o.OrderID, o.CustomerID
)
SELECT 
    c.CustomerID,
    c.CustomerName,
    COUNT(DISTINCT ot.OrderID) AS TotalOrders,
    AVG(ot.OrderTotal) AS AvgOrderAmount
FROM Customers c
LEFT JOIN order_totals ot ON c.CustomerID = ot.CustomerID
GROUP BY c.CustomerID, c.CustomerName;

3️⃣ Sales by Month for a Specific Product in 2023

Question:
Retrieve the total sales by month for a specific product in the year 2023.

WITH product_sales AS (
    SELECT 
        od.ProductID,
        DATE_FORMAT(o.OrderDate, '%Y-%m') AS Month,
        SUM(od.Quantity * od.UnitPrice) AS TotalSales
    FROM OrderDetails od
    JOIN Orders o ON od.OrderID = o.OrderID
    WHERE od.ProductID = 123  -- Replace with your product ID
      AND YEAR(o.OrderDate) = 2023
    GROUP BY od.ProductID, DATE_FORMAT(o.OrderDate, '%Y-%m')
)
SELECT *
FROM product_sales
ORDER BY Month;

4️⃣ Customers Who Have Not Placed Any Orders in the Last 6 Months

Question:
Retrieve a list of customers who have not placed any orders in the last 6 months.

WITH recent_orders AS (
    SELECT CustomerID
    FROM Orders
    WHERE OrderDate >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
)
SELECT c.CustomerID, c.CustomerName
FROM Customers c
LEFT JOIN recent_orders ro 
ON c.CustomerID = ro.CustomerID
WHERE ro.CustomerID IS NULL;

Pandas interview questions

data = [
    (1, "USA", "A", 200),
    (2, "Canada", "B", 300),
    (3, "India", "C", 50),
    (4, "USA", "D", 5000),
    (5, "India", "A", 800),
    (6, "Canada", "B", 9000),
    (7, "USA", "C", 2500),
    (8, "India", "D", 12000),
    (9, "Canada", "A", 500),
    (10, "USA", "B", 6000)
]

import pandas as pd

df = pd.DataFrame(data, columns=["ID", "Country", "Product", "Sales"])

1️⃣ Filter records where sales ≥ 100

Question:
Filter out the records where the sales amount is less than 100.

Solution:

# Filter sales >= 100
df_filtered = df[df["Sales"] >= 100]
print(df_filtered)

Output:

   ID Country Product  Sales
0   1     USA       A    200
1   2  Canada       B    300
3   4     USA       D   5000
4   5   India       A    800
5   6  Canada       B   9000
6   7     USA       C   2500
7   8   India       D  12000
8   9  Canada       A    500
9  10     USA       B   6000

2️⃣ Group by country and calculate total sales

Question:
Group the sales by country and calculate the total sales amount for each country.

Solution:

df_grouped = df_filtered.groupby("Country", as_index=False)["Sales"].sum()
df_grouped.rename(columns={"Sales": "Total_Sales"}, inplace=True)
print(df_grouped)

Output:

  Country  Total_Sales
0  Canada         9800
1   India        12800
2     USA        13700

3️⃣ Add a new column categorizing total sales

Question:
Add a new column that categorizes total sales into "High", "Medium", and "Low":

"High" if total sales > 10,000
"Medium" if total sales > 5,000 and ≤ 10,000
"Low" if total sales ≤ 5,000

Solution:

def categorize(sales):
    if sales > 10000:
        return "High"
    elif sales > 5000:
        return "Medium"
    else:
        return "Low"

df_grouped["Category"] = df_grouped["Total_Sales"].apply(categorize)
print(df_grouped)

Output:

  Country  Total_Sales Category
0  Canada         9800   Medium
1   India        12800     High
2     USA        13700     High

4️⃣ Optional: Rank countries by total sales

Question:
Rank the countries by total sales using a numeric rank.

Solution:

df_grouped["Rank"] = df_grouped["Total_Sales"].rank(ascending=False, method="first")
print(df_grouped)

Output:

  Country  Total_Sales Category  Rank
0  Canada         9800   Medium   3.0
1   India        12800     High   2.0
2     USA        13700     High   1.0

✅ Notes:

Step 1 filters the dataset to remove small sales.
Step 2 aggregates sales per country.
Step 3 adds a categorical label based on thresholds.
Step 4 shows how to rank items numerically in Pandas.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.