DISTINCT and COLLECT_SET are two vital functions used in Data analysis. Here are the differences and use cases to understand behavior, particularly in SQL-like environments (e.g., SQL, PySpark, etc.).

Differences: DISTINCT Vs. COLLECT_SET
1. DISTINCT
- Purpose: It filters out duplicate rows or values in a result set.
- Output: Returns a result set where all rows or specific columns are unique.
- Use case: Use
DISTINCTwhen you want to remove duplicate rows or unique values from a single column.
Example in SQL:
SELECT DISTINCT event_type
FROM event_table;
This query will return all distinct event_type values in the table, each appearing only once in the result.
Example in PySpark:
df.select("event_type").distinct().show()
This returns distinct rows for the event_type column, eliminating duplicates.
Example in Pandas:
df['event_type'].unique()
This returns a NumPy array with the unique values from the event_type column.
2. COLLECT_SET
- Purpose: It gathers all unique values within a group, outputting them as an array (set).
- Output: Returns an array (or list) of unique values within each group (i.e., groups created via
GROUP BY). - Use case: Use
COLLECT_SETwhen you like to gather distinct elements from multiple rows into a single result. This is useful for grouping events or attributes per user.
Example in SQL (Hive/SparkSQL):
SELECT user_id, COLLECT_SET(event_type) AS unique_events
FROM event_table
GROUP BY user_id;
This query will return the unique event_type values for each user_id as an array (or set).
Example in PySpark:
from pyspark.sql import functions as F
df.groupBy("user_id").agg(F.collect_set("event_type").alias("unique_events")).show()
This groups by user_id and returns a set of unique event_type values for each group.
Example in Pandas (equivalent using apply() and set()):
result = df.groupby('user_id')['event_type'].apply(lambda x: set(x)).reset_index()
This group by user_id and returns the unique event_type values assets.
collect_set and array_sort together usage
DataFrame Creation:
df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
This creates a Spark DataFrame df2 with one column named "age" and three rows: (2,), (5,), (5,). So, df2 looks like:
+---+
|age|
+---+
| 2|
| 5|
| 5|
+---+
Aggregation using collect_set and array_sort:
df2.agg(array_sort(collect_set('age')).alias('c')).collect()
collect_set('age'): This function collects all the distinct values of the "age" column into a set, removing duplicates. In this case, the distinct values are [2, 5].
array_sort: After the set is collected, the array_sort function sorts these distinct values in ascending order. So, the result would be [2, 5].
alias('c'): The result of the aggregation is renamed to "c".
.collect(): This returns the final result as a list of rows. Since we applied an aggregation (agg), the result will be a single row.
Final Output: The final result of the collect operation is:
[Row(c=[2, 5])]
#Explanation
#You are creating a DataFrame df2 with one column "age".
#You are collecting the distinct values of the "age" column, sorting them, #and returning the result as a list in a single row.
#Since the values are [2, 5], the sorted distinct list will be [2, 5]
#For descending order use reverse() function
df2.agg(reverse(array_sort(collect_set('age'))).alias('c')).collect()
Output
[Row(c=[5, 2])]
Key Differences:
- DISTINCT:
- Operates on individual columns or rows.
- Returns a flat list of unique values.
- It removes duplicates from the entire result set.
- COLLECT_SET:
- Operates within groups of data.
- Returns a set (or array) of unique values for each group.
- Typically used together with
GROUP BYto gather unique elements per group.
In summary:
- Use
DISTINCTwhen you want unique values across an entire result set. - Use
COLLECT_SETwhen you want unique values within groups of data.







You must be logged in to post a comment.