Here are Python interview questions and answers for NumPy, Pandas, and PySpark from CGI. Practicing these will enhance your coding skills.

Python interview questions
Photo by Asiama Junior on Pexels.com

Python interview questions and answers

01. List comprehension, how do you get only odd values?

List comprehension simplifies coding. You can write code in a single line. Which saves time. Here is my other post on set comprehension.

numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
odd_numbers = [x for x in numbers if x % 2 != 0]
print(odd_numbers)

02. Can we read only 5 lines from a file in Python?

Yes. You can use slicing logic or specify specific lines in the readlines() method to get the required lines.

Method:1
----------

with open("/content/sample_data/california_housing_test.csv", "r") as file:
  lines = file.readlines()[:5]

for line in lines:
  print(line)

Method 2
-------
with open("/content/sample_data/california_housing_test.csv", "r") as file:
    lines = []
    for _ in range(5):
        line = file.readline()
        if not line:  # Break if the end of file is reached
            break
        lines.append(line)

03. Without using the counter method, how do you count each item in a list?

def count_duplicates(lst):
    # Initialize an empty dictionary to store the counts of duplicates

    duplicates = {}
    # Iterate over the list
    for item in lst:
        # Check if the item is already in the dictionary
        if item in duplicates:
            # Increment the count of the item
            duplicates[item] += 1
        else:
            # Add the item to the dictionary with a count of 1
            duplicates[item] = 1
    # Remove items with a count of 1 from the dictionary

    duplicates = {item: count for item, count in duplicates.items() if count > 1}
    # Return the duplicates dictionary
    return duplicates

# Example usage
my_list = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
result = count_duplicates(my_list)
print(result)

04. How to compare two lists, and write duplicates to a new list?

Here’s a way:

def get_duplicates(list1, list2):
    duplicates = []
    for item in list1:
        if item in list2 and item not in duplicates:
            duplicates.append(item)
    return duplicates
# Example usage
list1 = [1,2, 3, 4, 5]
list2 = [4,5, 6, 7, 8]
result =get_duplicates(list1, list2)
print(result)
Advertisements

05. Python Vs PySpark?

Execution and Scaling:

Python is a programming language that runs on a single machine. PySpark is a framework that is built on Apache Spark. It allows for distributed computing. With Python, you can process data on one machine. But in PySpark, you can distribute the processing across multiple machines in a cluster. PySpark can handle larger datasets. It processes them in parallel. This makes PySpark ideal for scaling up big data processing.

Data Representation

In Python, data portrays the built-in data structures like lists, dictionaries, and tuples. On the other hand, in PySpark, data is organized in distributed collections. These collections are RDDs (Resilient Distributed Datasets) or DataFrames for distributed processing. They can efficiently handle large-scale data manipulation.

Performance

PySpark generally offers better performance for big data processing compared to Python, especially for computationally intensive tasks. PySpark takes advantage of distributed computing and optimized query execution plans provided by Apache Spark, resulting in faster processing times.

API and Libraries

Python has a wide range of libraries and packages for different tasks. PySpark has its own API that includes many built-in functions and transformations for distributed data processing. It also works well with other Python libraries, allowing you to combine the power of Spark with Python’s functionality.

PySpark is a distributed processing runs on multiple nodes. Python runs on a single node. 

06. Parallel processing Python Vs. PySpark?

In Python, parallel processing can be achieved using libraries like multiprocessing or concurrent.futures. These provide functionalities for creating and managing multiple processes or threads.

In PySpark, parallel processing is at the core of its design. PySpark leverages Apache Spark’s parallel processing capabilities to distribute and process data across a cluster of machines. PySpark uses a distributed computing engine to perform operations on large datasets in parallel. This can significantly speed up data processing tasks.

Python has parallel processing capabilities through libraries. Yet, PySpark is specifically designed for distributed parallel processing on big data sets. It leverages Spark’s built-in capabilities.

07. How do we read huge-memory files in Pandas?

We can do it in various ways:

To read a huge memory file in pandas, you can use the read_csv() function. Use appropriate parameters to optimize memory usage. Here are a few techniques you can use:

Specify dtype: If you know the data types of your columns, you can specify them using the dtype parameter. This allocates the right amount of memory for each column. For example:

dtypes = {'col1': 'int32', 'col2': 'float64', 'col3': 'object'}

df = pd.read_csv('huge_file.csv', dtype=dtypes)

Use chunksize: Reading the file in chunks using the chunksize parameter can help reduce memory usage. This returns the data as an iterable, allowing you to process the data in smaller portions. For example:

chunk_iterator = pd.read_csv('huge_file.csv', chunksize=100000)

for chunk in chunk_iterator:

    # Process each chunk of data

Use usecols: If you only need a subset of columns from the file, you can specify those columns. You can do this using the usecols parameter. This reduces the memory needed to load the data. For example:

usecols = ['col1', 'col2', 'col3']

df = pd.read_csv('huge_file.csv', usecols=usecols)

Specify low_memory: By default, pandas try to infer the data types while reading the file. This can consume more memory. You can set low_memory=True to reduce memory usage at the cost of a potentially slower parsing process. For example:

df = pd.read_csv('huge_file.csv', low_memory=True)

Additionally, if your file is in a format other than CSV, you can use similar techniques. Just use the appropriate read_ function provided by pandas. For instance, use read_excel() for Excel files. Use read_parquet() for Parquet files.

08. What is a Lazy evaluation in PySpark?

It executes the transformation when we call it, not when we create it. We call this concept Lazy evaluation. In the below code, during print(transformed_rdd.collect()), the transformation executes.

Sample code:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a list of numbers
numbers = [1, 2, 3, 4, 5]

# Create an RDD from the list
rdd = spark.sparkContext.parallelize(numbers)

# Apply transformations on the RDD
transformed_rdd = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 2)

# Print the transformed RDD (this is an action)
print(transformed_rdd.collect())

Output:

[2, 4, 6, 8, 10]

Conclusion

Use these Python interview questions and answers, and understand how to use Python in data analysis.