The in and out of Group By clause in SQL

GROUP BY Clause – The GROUP BY clause is based on simple partitions. A partition of a set divides the set into subsets such that the union of the subsets returns the original set, and the intersection of the subsets is empty. Think of it as cutting up a pizza pie—each piece of pepperonion belongs to one and only one slice of pizza.

The GROUP BY clause takes the result of the FROM and WHERE clauses, then puts the rows into groups defined as having the same values for the columns listed in the GROUP BY clause.

Each group is reduced to a single row in the result table. This result table is called a grouped table, and all operations are now defined on groups rather than on rows.
By convention, the NULLs are treated as one group. The order of the grouping columns in the GROUP BY clause does not matter, but since all or some of the column names have to appear in the SELECT list, you should probably use the same order in both lists for readability.
Please note the SELECT column names might be a subset of the GROUP BY clause column names, but never the other way around. Let us construct a sample table called Villes to explain in detail how this works. The table is declared as:

CREATE TABLE Villes
(state_code CHAR(2) NOT NULL, -- usps codes
city_name CHAR(25) NOT NULL,
PRIMARY KEY (city_name, state_code));

We populate it with the names of cities that end in “-ville” in each state. The first problem is to find a count of the number of such cities by state_code. The immediate naïve query might be:

SELECT state_code, city_name, COUNT(*)
FROM Villes
GROUP BY state_code;

The groups for Tennessee would have the rows (‘TN’, ‘Nashville’) and (‘TN’, ‘Knoxville’). The first position in the result is the grouping column, which has to be constant within the group. The third column in the SELECT clause is the COUNT(*) for the group, which is clearly two. The city_name column is a problem.

Since the table is grouped by states, there can be at most 50 groups, one for each state_code. The COUNT(*) is clearly a single value, and it applies to the group as a whole.

But what possible single value could I output for a city_name in each group? Pick a typical city_name and use it? If all the cities have the same name, use that name; otherwise, output a NULL.

The worst possible choice would be to output both rows with the COUNT(*) of 2, since each row would imply that there are two cities named Nashville and two cities named Knoxville in Tennessee.

Each row represents a single group, so anything in it must be a characteristic of the group, not of a single row in the group.

This is why there is a rule that the SELECT list must be made up only of grouping columns with optional aggregate function expressions.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.