CGI: PySpark Interview Questions Revealed

Here are the interview questions asked during the CGI interview. They cover both theoretical and coding aspects of PySpark.

PySpark Interview Questions

01. RDD VS. Dataframe in PySpark?

RDD (Resilient Distributed Dataset) and Dataframe are two fundamental data structures in PySpark. They differ in their underlying implementations and usage.

RDD: RDD is a distributed collection of objects that can be processed in parallel. RDDs are immutable and can hold any type of Python objects. RDD provides low-level transformations and actions, allowing fine-grained control over data processing. It is suitable for low-level and complex data manipulations as it provides a more flexible and expressive API. However, RDD lacks the optimization techniques provided by Dataframe, resulting in slower performance for certain operations.
Dataframe: Dataframe is a distributed collection of data organized into named columns, similar to a table in a relational database or a dataframe in Python’s Pandas library. Dataframes are built on top of RDDs and provide a higher-level and more structured API. They support various features such as the ability to infer the schema, perform SQL-like queries, and leverage query optimization techniques like predicate pushdown and column pruning. Dataframes are generally easier to use and provide better performance for most use cases compared to RDDs.

In summary, RDDs are more suitable when you need fine-grained control over data processing and complex transformations, while Dataframes offer a more optimized and user-friendly API for structured data processing, similar to working with SQL tables or Pandas dataframes.

02. Pandas Vs PySpark?

Pandas and PySpark are both widely used libraries for data processing and analysis in Python, but they differ in their underlying architecture and use cases.

Pandas:

Pandas is a popular open-source library that provides high-performance data manipulation and analysis tools. It is designed for working with structured data that can fit into memory on a single machine. With Pandas, you can easily load, manipulate, and analyze data using various data structures such as Dataframes and Series.
Pandas is particularly well-suited for data cleaning, exploration, and performing complex transformations on small to medium-sized datasets. It offers a rich set of functions and methods for tasks like filtering, sorting, aggregating, joining, and reshaping data.
Pandas leverages a single-machine architecture, which means it operates on data that can fit into the memory of a single machine. This design choice provides fast and efficient data processing on a local machine, making it a go-to choice for many data analysts and data scientists working with smaller datasets.

PySpark:

PySpark, on the other hand, is the Python API for Apache Spark, a fast and general-purpose cluster computing system. Spark allows you to process and analyze large-scale data sets in a distributed and parallel manner across a cluster of machines.
PySpark provides a distributed data processing framework based on RDDs (Resilient Distributed Datasets) and Dataframes. It offers a powerful and scalable platform for big data processing tasks, such as large-scale data manipulations, machine learning, and real-time analytics.
Spark’s parallel and distributed nature enables it to handle and process data that is too large to fit into the memory of a single machine. By distributing the workload across multiple machines, Spark can provide faster and more efficient processing of big data compared to traditional single-machine solutions like Pandas.
However, working with PySpark typically involves a steeper learning curve compared to Pandas due to its distributed nature and the need to understand concepts like RDDs, transformations, and actions.

In summary, Pandas is well-suited for working with small to medium-sized datasets on a single machine, offering a rich library of data manipulation and analysis functions. PySpark, on the other hand, is designed for distributed big data processing, providing scalability and performance advantages for handling large-scale datasets across a cluster of machines. The choice between Pandas and PySpark depends on the size and complexity of your data and the scalability requirements of your analysis task.

Advertisements

03. How to remove duplicates in PySpark?

To remove duplicates in PySpark, you can use the dropDuplicates method. Here’s an example code snippet:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("RemoveDuplicatesExample") \
    .getOrCreate()

# Create a dataframe
data = [("John", 25),
        ("Alice", 30),
        ("John", 25),
        ("Alice", 35)]

df = spark.createDataFrame(data, ["Name", "Age"])

# Remove duplicates
distinct_df = df.dropDuplicates()

# Show the result
distinct_df.show()

Output

+-----+---+

| Name|Age|

+-----+---+

| John| 25|

|Alice| 30|

|Alice| 35|

+-----+---+

In the code above, we first import the necessary modules and create a SparkSession. Then we create a dataframe df with some sample data. Finally, we use the dropDuplicates method on the dataframe to remove duplicate rows. The resulting dataframe, distinct_df, will only contain the unique rows.

You can customize the dropDuplicates method by specifying the subset of columns to consider for duplicate removal. For example, if you want to remove duplicates based on the “Name” column, you can use df.dropDuplicates(["Name"]).

Please note that the dropDuplicates method works based on the values of the specified columns, so it’s important to consider the columns relevant to your use case.

04. How to merge two dataframes in PySpark?

To merge two dataframes in PySpark, you can use the join method. Here’s an example code snippet:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("MergeDataframesExample") \
    .getOrCreate()

# Create first dataframe
data1 = [("John", 25),
         ("Alice", 30),
         ("Bob", 35)]

df1 = spark.createDataFrame(data1, ["Name", "Age"])

# Create second dataframe
data2 = [("John", "Developer"),
         ("Alice", "Engineer"),
         ("Eve", "Scientist")]

df2 = spark.createDataFrame(data2, ["Name", "Profession"])

# Merge dataframes
merged_df = df1.join(df2, on="Name", how="inner")

# Show the result
merged_df.show()

Output

+-----+---+----------+

| Name|Age|Profession|

+-----+---+----------+

|Alice| 30|  Engineer|

| John| 25| Developer|

+-----+---+----------+

In the code above, we first import the necessary modules and create a SparkSession. Then we create two dataframes, df1 and df2, with some sample data.

To merge the dataframes, we use the join method on df1 and specify the common column to join on using the on parameter (in this case, “Name”). The how parameter is set to “inner” to perform an inner join, which will include only the rows with matching values in both dataframes.

The resulting dataframe, merged_df, will contain the merged data from both dataframes, with columns from both dataframes combined.

You can choose different join types by changing the how parameter. For example, you can use “left”, “right”, or “outer” to perform left join, right join, or full outer join, respectively.

Please note that for the join operation to work correctly, the column you’re joining on should have the same name and data type in both dataframes.

Advertisements

05. How to merge two dataframe in PySpark of different Schema?

To merge two dataframes in PySpark with different schemas, you can use the union and select methods. Here’s an example code snippet:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col,lit



# Create a SparkSession

spark = SparkSession.builder \

    .appName("MergeDifferentSchemasExample") \

    .getOrCreate()



# Create first dataframe

data1 = [("John", 25),

         ("Alice", 30),

         ("Bob", 35)]



df1 = spark.createDataFrame(data1, ["Name", "Age"])



# Create second dataframe with a different schema

data2 = [("John", "Developer"),

         ("Alice", "Engineer"),

         ("Eve", "Scientist")]



df2 = spark.createDataFrame(data2, ["Name", "Profession"])



# Add missing column to the first dataframe

df1 = df1.withColumn("Profession", lit(None))



# Select columns in the second dataframe to match the first dataframe

df2 = df2.select("Name", lit(None).alias("Age"), "Profession")



# Union the two dataframes

merged_df = df1.union(df2)



# Show the result

merged_df.show()

Output

+-----+----+----------+

| Name| Age|Profession|

+-----+----+----------+

| John|  25|      NULL|

|Alice|  30|      NULL|

|  Bob|  35|      NULL|

| John|NULL| Developer|

|Alice|NULL|  Engineer|

|  Eve|NULL| Scientist|

+-----+----+----------+

In the code above, we first import the necessary modules and create a SparkSession. Then, we create two dataframes, df1 and df2, with different schemas.

To merge the dataframes, we add a missing column to the first dataframe using the withColumn method. We set the new column, “Profession”, to None using the lit function.

Next, we select the necessary columns from the second dataframe to match the schema of the first dataframe. We use the select method and provide the column names to include in the merged dataframe. For the missing columns, we use None values and alias them appropriately.

Finally, we use the union method to combine the two dataframes vertically. The resulting dataframe, merged_df, will have all the rows from both dataframes with a unified schema.

Please note that when merging dataframes with different schemas, it’s important to ensure that the column names, data types, and order of columns match or are appropriately handled in the desired output.

References

PySpark Performance and Working on Cloud environment

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.