Introduction to Pandas

Pandas is a highly regarded open-source library in Python that provides powerful data manipulation and analysis capabilities. It is designed for working with structured data and offers flexible and efficient data structures, primarily Series and DataFrame.

The Series is a one-dimensional labeled array capable of holding any data type, while the DataFrame is a two-dimensional labeled data structure that can store columns of different types. This versatility makes Pandas an ideal choice for data analysis tasks ranging from simple operations to complex transformations.

With functionalities like reading and writing data from various file formats, handling missing data, and performing data aggregation, Pandas is frequently used in data science, machine learning, and statistical analysis. Its integration with libraries such as NumPy and Matplotlib extends its capabilities, making it a cornerstone of the data analysis workflow in Python.

Whether you are analyzing small datasets or working with large volumes of data, Pandas provides the tools needed to clean, transform, and visualize your data effectively.

Pandas Interview Questions for Freshers

Pandas interview questions

1. What is Pandas?

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures like Series and DataFrame; to work with structured data efficiently.

2. How do you create a DataFrame in Pandas?

import pandas as pd
# From a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

print(df)

# From a list of lists

df2 = pd.DataFrame([['Alice', 25], ['Bob', 30], ['Charlie', 35]], columns=['Name', 'Age'])

print(df2)

3. Explain the difference between Series and DataFrame in Pandas?

A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

# Series example

series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

print(series)
# DataFrame example

df = pd.DataFrame({'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']})
print(df)

4. How do you read a CSV file into a Pandas DataFrame?

df = pd.read_csv('file.csv')  # Replace 'file.csv' with your file path

print(df)

5. What are the different ways to select columns and rows in a DataFrame?

# Selecting columns

print(df['Name'])
print(df[['Name', 'Age']])

# Selecting rows by index

print(df.iloc[0])    # First row
print(df.loc[0])     # Row with index 0
print(df.loc[df['Age'] > 30])  # Rows with Age > 30

Intermediate Pandas Questions

6. How do you handle missing data in Pandas?

df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill missing values with mean

df.dropna(subset=['Name'], inplace=True)          # Drop rows where 'Name' is missing

7. What is the difference between apply(), map(), and applymap() functions?

# `apply()` example (on a DataFrame column)

df['Age'] = df['Age'].apply(lambda x: x + 1)

# `map()` example (on a Series)

df['Name'] = df['Name'].map({'Alice': 'Alicia', 'Bob': 'Robert'})

# `applymap()` example (on a DataFrame)

df[['Age', 'NewColumn']] = df[['Age', 'NewColumn']].applymap(str)

8. How can you merge/join DataFrames in Pandas?

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 40]})

# Merge DataFrames on 'ID' column

merged_df = pd.merge(df1, df2, on='ID', how='inner')  # Inner join

print(merged_df)

9. How do you group data in Pandas?

grouped = df.groupby('Age').size()  # Count occurrences of each Age

print(grouped)

# Group by column and aggregate

agg_result = df.groupby('Age').agg({'Name': 'count'})

print(agg_result)

10. How can you pivot a DataFrame in Pandas?

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice', 'Bob'], 'Year': [2021, 2021, 2022, 2022], 'Value': [10, 20, 15, 25]})

pivot_df = df.pivot(index='Year', columns='Name', values='Value')
print(pivot_df)

11. How do you convert a column to a datetime format?

df['Date'] = pd.to_datetime(df['Date'])
print(df.dtypes)

12. How do you remove duplicates from a DataFrame?

df.drop_duplicates(subset='Name', keep='first', inplace=True)

print(df)

13. What is the difference between sort_values() and sort_index()?

# Sorting by column values

df_sorted_by_value = df.sort_values(by='Age')

# Sorting by DataFrame index

df_sorted_by_index = df.sort_index()

14. How can you concatenate DataFrames?

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})

# Concatenate DataFrames

concatenated_df = pd.concat([df1, df2], ignore_index=True)

print(concatenated_df)

15. How do you change the data type of a column?

df['Age'] = df['Age'].astype(float)

print(df.dtypes)

Advanced Pandas Questions

16. How do you handle large datasets with Pandas?

# Read in chunks

chunks = pd.read_csv('large_file.csv', chunksize=1000)

for chunk in chunks:

    process(chunk)  # Replace `process` with your processing function

17. How do you calculate rolling or moving statistics?

df['RollingMean'] = df['Value'].rolling(window=3).mean()

print(df)

18. What is the use of MultiIndex in Pandas?

# Create a MultiIndex DataFrame

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]

index = pd.MultiIndex.from_arrays(arrays, names=('Group', 'Number'))

df = pd.DataFrame({'Value': [10, 20, 30, 40]}, index=index)

print(df)

19. How do you create a custom function, and apply it to a DataFrame?

def custom_function(x):

    return x * 2

df['NewColumn'] = df['Age'].apply(custom_function)

print(df)

20. How do you perform data visualization with Pandas?

import matplotlib.pyplot as plt

df['Age'].plot(kind='bar')

plt.show()

References

  • Here is the link to the official Pandas documentation for your reference.