Here is a dump of questions on Python, SQL, and PySpark. You may be asked similar questions in interviews.

Interview questions on Python, PySpark and SQL

Python questions

1) How do we connect to SQL using Python?

A) .
import mysql.connector
connection = mysql.connector.connect(
host="h_name",
user="u_name",
password="pwd"
)
cursor = connection.cursor()
cursor.execute("select database();")

2) What are comprehensions and lambda expression?

A).

Comprehensions:

Comprehensions make it easy to create Lists, Sets, and Dictionaries. The comprehension consists of a single expression followed by at least one for clause and zero or more for or if clauses.

Lambda Expressions

Lambda functions are anonymous functions, they execute at run time. They take n number of arguments but return one value. Example syntax: lambda arguments : expression

3) what are decorators?

A)

A Decorator function takes a function as an argument. And, adds some functionality to it. With the “@” we can identify the Decorator.

4) what is overloading?

A)

In simple terms, showing the multiple behaviors for a single operation. Two types of overloading: 1) Operator overloading and 2) Function overloading

Operator overloading : +, – , *
Function overloading: Python does not support functional overloading. Indirectly, we can implement this using the i) Default parameters ii) Variable length arguments iii) functools

Learning PySpark Easy to Read Book

5) How is memory managed in Python?

A)
Python allocates memory from two areas. i) Stack memory ii) Heap memory. Python also has Object pools (for small objects) and memory blocks for larger objects. The Garbage collector recycles and releases memory from the unused objects.

6) What are Python modules?

A)

The Python modules are math, os, sys, datetime etc.

7) what are local and global variables in Python?

Local variables are local to the functions: the scope is within the function where it is declared. On the other hand, we can use global variables anywhere in the program (global keyword). Unlike local, they declare outside of the function.

8) What’s the Diff between .py and .pyc file?

A)

.py ==> Human readable Python code: Developers written code.

.pyc => Generated by Python interpreter to speed up the module loading.

9) Is Python an interpreted or compiled language?

Python is an interpreted language

10) How do we read Excel or CSV files in Python?

A).
CSV files:

import csv
with open("file path", "r") as f:
csv_reader=csv.reader(f)
for c in csv_reader
print(c)

Excel files:

from openpyxl import load_workbook
workbook = load_workbook('path')
sheet = workbook.active
for row in sheet.iter_rows(values_only=True):
print(row)

11) what is slicing?

A)
Return a new slice object with the given values. The start, stop, and step parameters are used as the values of the slice object attributes of the same names. Any of the values may be NULL, in which case None will be used for the corresponding attribute.

12) What is Pickling?

“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream.

Program questions:

1) Write a Python program to print the occurrence of numbers in a list?

A).

numbers = [1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 7, 8, 8, 8, 8]
occurrences = {}
#
for number in numbers:
if number in occurrences:
  occurrences[number] += 1
else:
  occurrences[number] = 1
for number, count in occurrences.items():

print(f"{number},{count}")

2) Write a Python program to print common letters from all the words in a list
Eg: I/P: [‘Phani’, ‘Virat’, ‘Chaimp’]
O/P: a, i
?

Using set, and intersection we can achieve this

inp = ['Phani', 'Virat', 'Chaimp']

res = set(inp[0].lower())

for i in range(1, len(inp)):
    res = res.intersection(set(inp[i].lower()))

out=sorted(list(res))
print(out)

SQL:

1) what is coalesce?

A).
The coalesce in SQL gives the first non-null column values. If all the columns are null, it gives a specified value.

2) What are the constraints of SQL?

A).
PRIMARY KEY, FOREIGN KEY, CHECK, NOT NULL, DEFAULT, CREATE_INDEX

3) Primary key Vs. Foreign key?

A).
The primary Key uniquely identifies each row in a Table while the foreign key links to another table through the Primary key.

4) Different types of JOINS in SQL?

A). LEFT, RIGHT, FULL, INNER, ANTI, SEMI, NATURAL

5) What’s the Difference Between Union All and Full Outer Join?

  • Union All: It includes all the rows
  • Full outer join: Left join + union + right join

6) What are window functions? and i) Diff between Rank, DenseRank, and RowNum?

A).

Window function:
A window function performs an aggregate-like operation on a set of query rows. However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row.

Rank:
Rank assigns unique numbers in sequence. If the value ties, it assigns the same rank and skips the continuation number to the next value.

Dense_rank:
Dense rank assigns unique numbers in sequence. If ties, it assigns the same rank. For the next value, the number continues — no skipping.

Row_number:
It assigns a unique number to all the rows.

7) What’s the order of execution of a SELECT Query?

A)

  • FROM
  • JOIN
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT
  • DISTINCT
  • ORDER BY
  • LIMIT

8) How do we solve Duplication issues in SQL?

i) SELECT DISTINCT
ii) GROUP BY
HAVING

Queries:
1)
eno, ename, sal, grp
10, a, 500, grp1
20, b, 300, grp2
30, c, 400, grp1
40, d, 400, grp3
50, e, 200, grp1
60, f, 300, grp3

print 3rd max salary from that table
(In ans we have to make sure we get both the rows for 3rd max)

Solution:

WITH cte AS (
SELECT eno, ename, sal, grp, DENSE_RANK() OVER (ORDER BY sal DESC) AS dr
FROM emp
)
SELECT eno, ename, sal
FROM cte
WHERE dr = 3;

2) empid, grp, DOJ
10, A, Aug 2021
10 B Oct 2021
11 A Mar 2022
12 A Jan 2022
12 C Nov 2021
13 B Apr 2021

Print emp who are part of only group A (expected output only 11 but not 10 and 12)

A).
SELECT empid
FROM employees
GROUP BY empid
HAVING COUNT(DISTINCT grp) = 1 AND MAX(grp) = 'A';

Pyspark

01). What is the difference between cache and persist method?

A). Fundamentally, Cache() and persist() are used to store data. Cache() stores in default storage memory_only. But persist use DISK_ONLY, MEMORY_AND_DISK, and more

02). How do we handle multi-column delimiters using spark?

A).

Using the split function.
split_col = split(df['col_name'], '[,;]')

df_split = df.select(*[col('col_name')[i].alias(f'col_{i}') for i in range(len(split_col))])

03). What is the coalesce transformation? –decrease no of partitions?

A). Coalesce() reduces the number of transformations.
df_coalesced = df.coalesce(2) ==> The dataframe will now limit to 2 partitions.

04). What’s the difference between Map vs. Flatmap?

A). We can apply map() and flatMap() on each element. The main difference is in map(), the output will have the same number of elements as the input. However, in flatMap(), the output may have the same or different elements.

05). Is it Possible to Delete and Update in the Hive Table? Have You ever used it in Your Project?

A). No

06). The default join used in Spark?

A). Inner join is the default join

07). What is a Broadcast Variable?

A). A broadcast variable is a programming mechanism that allows a read-only variable to be cached on each machine instead of being sent with tasks

08). What is checkpoint?

A). Checkpointing is a technique. It saves the process state at regular intervals so that it can be restarted from that point if it fails.

09). How to change column type in RDD, DF & Spark SQL?

A).

RDD

rdd_int = rdd.map(lambda x: int(x))

DF

df = df.withColumn("age", col("age").cast("int"))

Spark SQL

result_df = spark.sql("SELECT name, CAST(age AS INT) AS age FROM people")

10). Have you handled date functions in Spark, and how do you extract the Year from the given date?

A). df_with_year = df.withColumn(“year”, year(df[“date”]))

11). The difference between select, selectexpr and withColumn in Spark DF?

A).
select ==> We can select the columns we want
selectExpr ==> We can select column, and rename with new columns, by using “as” and col *2 etc.
withColumn ==> We can create a new column

12). How do you print the schema, number of columns, column name, and datatype?

A).
schema ==> df.printSchema()

columns ==>
num_columns = len(df.columns)
print(num_columns)

column name ==>

column_names = df.columns
print(column_names)

Datatypes ==>
data_types = df.dtypes
print(data_types)

13). From a dataframe and a tempview? Why do I have to do this?

df.createOrReplaceTempView("people")
result_df = spark.sql("SELECT name, age FROM people WHERE age > 28"

We can create SQL-type queries on TempView.

14). How do you define SparkContext?

A).
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

15). What’s the diff. b/w reduceByKey() & groupByKey()?

A).
reduceByKey() ==> reduces the data on the same data to a single value. [(“a”,4), (“b”, 5)]
groupByKey() ==> reduces the data on the same key to a group of elements (e.g list): [(“a”,[1,3]), (“b”, [3,2])]

#reduceByKey()
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 3)])
result = rdd.reduceByKey(lambda x, y: x + y).collect()
print(result)
# Output: [("a", 4), ("b", 5)]

#groupByKey()
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 3)])
result = rdd.groupByKey().mapValues(list).collect()
print(result)
# Output: [("a", [1, 3]), ("b", [2, 3])]

16). How can you define SparkConf?

from pyspark import SparkConf, SparkContext
conf = SparkConf() \
  .setAppName("My Spark Application") \
   .setMaster("local[*]")

17). How would you control the number of partitions in DF?

A).
While reading the data, on the dataframe you can set your choice of partitions.
df_repartitioned = df.repartition(10)

18). How can you create an RDD for a text file?

A).
rdd=sc.textFile(“path to text file”)

19). How will you submit a spark job?

A).
spark-submit \
--master local[*] \
/mypython.py