DISTINCT and COLLECT_SET are two vital functions used in Data analysis. Here are the differences and use cases to understand behavior, particularly in SQL-like environments (e.g., SQL, PySpark, etc.).

COLLECT_LIST VS DISTINCT
Photo by Mareefe on Pexels.com

Differences: DISTINCT Vs. COLLECT_SET

1. DISTINCT

Example in SQL:

SELECT DISTINCT event_type
FROM event_table;

This query will return all distinct event_type values in the table, each appearing only once in the result.

Example in PySpark:

df.select("event_type").distinct().show()

This returns distinct rows for the event_type column, eliminating duplicates.

Example in Pandas:

df['event_type'].unique()

This returns a NumPy array with the unique values from the event_type column.

2. COLLECT_SET

Example in SQL (Hive/SparkSQL):

SELECT user_id, COLLECT_SET(event_type) AS unique_events
FROM event_table
GROUP BY user_id;

This query will return the unique event_type values for each user_id as an array (or set).

Example in PySpark:

from pyspark.sql import functions as F

df.groupBy("user_id").agg(F.collect_set("event_type").alias("unique_events")).show()

This groups by user_id and returns a set of unique event_type values for each group.

Example in Pandas (equivalent using apply() and set()):

result = df.groupby('user_id')['event_type'].apply(lambda x: set(x)).reset_index()

This group by user_id and returns the unique event_type values assets.

collect_set and array_sort together usage

DataFrame Creation:

df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
This creates a Spark DataFrame df2 with one column named "age" and three rows: (2,), (5,), (5,). So, df2 looks like:

+---+
|age|
+---+
|  2|
|  5|
|  5|
+---+
Aggregation using collect_set and array_sort:

df2.agg(array_sort(collect_set('age')).alias('c')).collect()

collect_set('age'): This function collects all the distinct values of the "age" column into a set, removing duplicates. In this case, the distinct values are [2, 5].

array_sort: After the set is collected, the array_sort function sorts these distinct values in ascending order. So, the result would be [2, 5].

alias('c'): The result of the aggregation is renamed to "c".

.collect(): This returns the final result as a list of rows. Since we applied an aggregation (agg), the result will be a single row.

Final Output: The final result of the collect operation is:
[Row(c=[2, 5])]

#Explanation
#You are creating a DataFrame df2 with one column "age".
#You are collecting the distinct values of the "age" column, sorting them, #and returning the result as a list in a single row.
#Since the values are [2, 5], the sorted distinct list will be [2, 5]

#For descending order use reverse() function
df2.agg(reverse(array_sort(collect_set('age'))).alias('c')).collect()

Output
[Row(c=[5, 2])]

Key Differences:

  • DISTINCT:
    • Operates on individual columns or rows.
    • Returns a flat list of unique values.
    • It removes duplicates from the entire result set.
  • COLLECT_SET:
    • Operates within groups of data.
    • Returns a set (or array) of unique values for each group.
    • Typically used together with GROUP BY to gather unique elements per group.

In summary:

  • Use DISTINCT when you want unique values across an entire result set.
  • Use COLLECT_SET when you want unique values within groups of data.