DISTINCT Vs. COLLECT_SET: Top Differences

DISTINCT and COLLECT_SET are two vital functions used in Data analysis. Here are the differences and use cases to understand behavior, particularly in SQL-like environments (e.g., SQL, PySpark, etc.).

COLLECT_LIST VS DISTINCT — Photo by Mareefe on Pexels.com

Differences: DISTINCT Vs. COLLECT_SET

1. DISTINCT

Purpose: It filters out duplicate rows or values in a result set.
Output: Returns a result set where all rows or specific columns are unique.
Use case: Use DISTINCT when you want to remove duplicate rows or unique values from a single column.

Example in SQL:

SELECT DISTINCT event_type
FROM event_table;

This query will return all distinct event_type values in the table, each appearing only once in the result.

Example in PySpark:

df.select("event_type").distinct().show()

This returns distinct rows for the event_type column, eliminating duplicates.

Example in Pandas:

df['event_type'].unique()

This returns a NumPy array with the unique values from the event_type column.

2. COLLECT_SET

Purpose: It gathers all unique values within a group, outputting them as an array (set).
Output: Returns an array (or list) of unique values within each group (i.e., groups created via GROUP BY).
Use case: Use COLLECT_SET when you like to gather distinct elements from multiple rows into a single result. This is useful for grouping events or attributes per user.

Example in SQL (Hive/SparkSQL):

SELECT user_id, COLLECT_SET(event_type) AS unique_events
FROM event_table
GROUP BY user_id;

This query will return the unique event_type values for each user_id as an array (or set).

Example in PySpark:

from pyspark.sql import functions as F

df.groupBy("user_id").agg(F.collect_set("event_type").alias("unique_events")).show()

This groups by user_id and returns a set of unique event_type values for each group.

Example in Pandas (equivalent using `apply()` and `set()`):

result = df.groupby('user_id')['event_type'].apply(lambda x: set(x)).reset_index()

This group by user_id and returns the unique event_type values assets.

collect_set and array_sort together usage

DataFrame Creation:

df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
This creates a Spark DataFrame df2 with one column named "age" and three rows: (2,), (5,), (5,). So, df2 looks like:

+---+
|age|
+---+
|  2|
|  5|
|  5|
+---+
Aggregation using collect_set and array_sort:

df2.agg(array_sort(collect_set('age')).alias('c')).collect()

collect_set('age'): This function collects all the distinct values of the "age" column into a set, removing duplicates. In this case, the distinct values are [2, 5].

array_sort: After the set is collected, the array_sort function sorts these distinct values in ascending order. So, the result would be [2, 5].

alias('c'): The result of the aggregation is renamed to "c".

.collect(): This returns the final result as a list of rows. Since we applied an aggregation (agg), the result will be a single row.

Final Output: The final result of the collect operation is:
[Row(c=[2, 5])]

#Explanation
#You are creating a DataFrame df2 with one column "age".
#You are collecting the distinct values of the "age" column, sorting them, #and returning the result as a list in a single row.
#Since the values are [2, 5], the sorted distinct list will be [2, 5]

#For descending order use reverse() function
df2.agg(reverse(array_sort(collect_set('age'))).alias('c')).collect()

Output
[Row(c=[5, 2])]

Key Differences:

DISTINCT:
- Operates on individual columns or rows.
- Returns a flat list of unique values.
- It removes duplicates from the entire result set.
COLLECT_SET:
- Operates within groups of data.
- Returns a set (or array) of unique values for each group.
- Typically used together with GROUP BY to gather unique elements per group.

In summary:

Use DISTINCT when you want unique values across an entire result set.
Use COLLECT_SET when you want unique values within groups of data.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.