26th Jan 2023 9 minutes read How to Group by Multiple Columns in SQL Gustavo du Mortier sql learn sql GROUP BY Table of Contents GROUP BY 1 Column GROUP BY 2 Columns GROUP BY Multiple Columns Other Ways of Using GROUP BY with Multiple Columns Using GROUP BY Multiple Columns: Grouping a Hierarchy Using GROUP BY Multiple Columns: Non-Hierarchical Grouping GROUP BY with Multiple Columns Returns Faceted Information When analyzing large data sets, you often create groupings and apply aggregate functions to find totals or averages. In these cases, using the GROUP BY clause with multiple columns unfolds its full potential. GROUP BY is a clause of the SELECT command. It allows you to compute various statistics for a group of rows. For example, you can use GROUP BY with an employee table to know how many employees are of each gender. Or you can group by multiple columns to determine the average age of vehicles for each make and model in a vehicle_fleet table. In this article, we’ll examine in detail how grouping by multiple columns works. This article assumes you already know how to use GROUP BY in an SQL query. Not familiar with GROUP BY? The best way to learn this and other basic SQL constructions is with our interactive SQL Basics course. It contains 129 hands-on practical exercises. In each exercise, you get a short explanation and task to solve. With each exercise completed, you build confidence in your SQL skills. This course is also a great way to review basic SQL features if your knowledge is a bit rusty. Ok, let’s start with a refresher on a simple use case for GROUP BY. GROUP BY 1 Column Each combination of the values of column(s) specified in the GROUP BY clause constitutes a group; the SELECT command with a GROUP BY clause displays a single row for each group. It’s also good to note that GROUP BY allows you to apply aggregate functions on columns not included in the outstanding subset. Let’s see an example. I have created a table called WorldWideFriends that stores data on my friends in different parts of the world: FriendNameCityStateCountry MaríaAcapulcoGuerreroMéxico FernandoCaracasDistrito CapitalVenezuela GersonMedellínAntioquíaColombia MónicaBogotáCundinamarcaColombia PaulBogotáCundinamarcaColombia KevinLexingtonKentuckyUSA CeciliaGodoy CruzMendozaArgentina PabloAtlántidaCanelonesUruguay AndreaCdad. MendozaMendozaArgentina MarlonSao PauloSao PauloBrasil JoaoRio de JaneiroRio de JaneiroBrasil AndrésBarilocheRío NegroArgentina MarianoMiamiFloridaUSA I would like to use the information in this table to do some research – e.g. to get a list of the countries where my friends live, including the number of friends living in each country. If I wanted to know how many friends I have in each country, I’d use GROUP BY together with the COUNT() aggregate function: SELECT Country, COUNT(*) AS HowMany FROM WorldWideFriends GROUP BY Country; This query gives me a result set that condenses the rows with the same country into only one row, while COUNT(*) tells me how many repeated rows there are for each country: CountryHowMany Argentina3 Venezuela1 Colombia3 Brasil2 USA2 México1 Uruguay1 The above query gives me the information I would need if, for example, I needed to choose which country to travel to in order to meet as many of my friends as possible. If you’d like to read more about the basic usage of GROUP BY, I recommend our articles on What Is GROUP BY in SQL and How to Use GROUP BY. However, even if I travel to a country where many of my friends live, those friends may be located in different states. I may not have time to travel from one state to another to visit them all. So I need to refine my search a bit to find the geographic location where there is a higher concentration of my friends. GROUP BY 2 Columns So now I need to know how my friends are distributed by state as well as country. I can find this out by adding the column State to my previous GROUP BY Country (separating them with commas) and in the SELECT clause. The query looks like this: SELECT Country, State, COUNT(*) AS HowMany FROM WorldWideFriends GROUP BY Country, State; Looking at the results of this query, we can see some of the countries that previously appeared in only one row now appear in several rows. The reason is that when we add the State field, the query must assemble the groups with the rows that have the same value in both Country and State. In the previous query, the row corresponding to ‘Colombia’ had a 3 in the HowMany field. In this case, ‘Colombia’ appears in two rows with different values for State: one for 'Antioquia' and the other for 'Cundinamarca'. In the HowMany field, the row corresponding to 'Antioquia' indicates 1, while the row corresponding to 'Cundinamarca' indicates 2. This means that, in the disaggregated list, there are two rows with Country = 'Colombia' and State = 'Cundinamarca', and only one with Country = 'Colombia' and State = 'Antioquia'. The sum of the HowMany values of these two rows logically matches the previous HowMany value for the row corresponding to 'Colombia'. The same will be true for any of the other countries that are divided into several rows with different states. CountryStateHowMany ArgentinaMendoza2 ArgentinaRío Negro1 VenezuelaDistrito Capital1 ColombiaAntioquía1 ColombiaCundinamarca2 BrasilRio de Janeiro1 BrasilSao Paulo1 USAKentucky1 USAFlorida1 MéxicoGuerrero1 UruguayCanelones1 GROUP BY Multiple Columns Finally, if my intention is to make my trip as short as possible and still visit as many friends as possible, I just need to add the City column to my query – both in the SELECT and in the GROUP BY – to see which cities have the highest number of friends: SELECT Country, State, City, COUNT(*) AS HowMany FROM WorldWideFriends GROUP BY Country, State, City; When we add columns to GROUP BY, the number of rows in the result increases. This is because the number of possible value combinations grows. When I add the City column to the SQL GROUP BY, the size of the result grows considerably: CountryStateCityHowMany ArgentinaMendozaCdad. Mendoza1 ArgentinaMendozaGodoy Cruz1 ArgentinaRío NegroBariloche1 VenezuelaDistrito CapitalCaracas1 ColombiaAntioquíaMedellín1 ColombiaCundinamarcaBogotá2 BrasilRio de JaneiroRio de Janeiro1 BrasilSao PauloSao Paulo1 USAKentuckyLexington1 USAFloridaMiami1 MéxicoGuerreroAcapulco1 UruguayCanelonesAtlántida1 In this case, I think it would be better to see only those cities where there are more than one of my friends. So to summarize the results, I will use the HAVING clause. This clause allows me to set a condition on the results of the aggregate functions when using GROUP BY. Here, the condition to apply will be that the count of friends is greater than 1 (COUNT(*) > 1). After incorporating the HAVING clause, the query looks like this: SELECT Country, State, City, COUNT(*) AS HowMany FROM WorldWideFriends GROUP BY Country, State, City HAVING COUNT(*) > 1; And this way, the result of the query is reduced to a single row that shows me the only city where there is more than one of my friends: CountryStateCityHowMany ColombiaCundinamarcaBogotá2 Other Ways of Using GROUP BY with Multiple Columns It is common to use GROUP BY multiple columns when two or more of the columns in a query result form a hierarchy of classifications with several levels. Such hierarchies are found in many areas, such as: Detailed sales data with the sale date divided into year, quarter, and month. A manufacturer’s product catalog organized by family, brand, line, model. The payroll of a company’s employees organized by management, sector, department. In all these cases, different subsets of columns can be used in the GROUP BY to go from the general to the particular. Using GROUP BY Multiple Columns: Grouping a Hierarchy Let’s look at an example result set of sales data. Suppose you have a view called ViewSales that returns the following information: YearQuarterMonthDateQuantityUnit_Price 202141111/15/2021516.08 2021388/2/2021117.06 2022244/5/2022219.48 2022255/21/2022117.06 202141111/17/2021218.50 2022244/5/2022118.08 2022388/16/2022515.26 It is easy to see that the first fields of this table form a hierarchy, with the year as the highest level and the date as the lowest level. Using GROUP BY and the SUM() function, we can obtain total sales amounts by Year, by Quarter, by Month or by Date. If you want to get the total units sold and the average unit price per Year and Quarter, you need to specify those two columns in the SELECT and in the GROUP BY: SELECT Year, Quarter, SUM(Quantity) AS TotalQty, AVG(Unit_Price) as AvgUnit_Prc FROM ViewSales GROUP BY Year, Quarter; The result will be: YearQuarterTotalQtyAvgUnit_Prc 20214717.29 20213117.06 20222418.21 20223515.26 Please note that, although there is a hierarchical order, the data in the different grouping columns are independent of each other. This means that if you group only by Quarter instead of by Year plus Quarter, the aggregate calculations will combine the information from the same quarter for all years (i.e. all Q2s will have one row): SELECT Quarter, SUM(Quantity) AS TotalQty, AVG(Unit_Price) as AvgUnit_Prc FROM ViewSales GROUP BY Quarter; QuarterTotalQtyAvgUnit_Prc 4717.29 3616.16 2418.21 This is not a mistake; you just need to understand that the results express different insights. The latter query allows you to compare sales performance between different quarters regardless of the year (e.g. to detect seasonal factors affecting sales at the same time of every year), while the former compares sales for each particular year and quarter. Using GROUP BY Multiple Columns: Non-Hierarchical Grouping In the previous example, we saw that grouping by multiple columns allows us to go from the general to the particular when we have data sets with columns that form a data hierarchy. But in situations where a result set is made up of columns that do not form a hierarchy, using GROUP BY with multiple columns allows us to discover hidden truths in large data sets; it combines attributes that at first glance are unrelated to each other. For example, let’s imagine we have a table named Downloads that stores information on people who have downloaded movies from a streaming service in the last two years. That table has one row for each download, and each row includes the following information about each person who downloaded a movie: Age Gender Nationality Each row also captures these attributes about each downloaded movie: Genre Year Country Using GROUP BY with several of these columns and the COUNT(*) function, we can detect correlations between the columns. For example, to find out movie genre preferences by age, we’d type: SELECT Age, Genre, COUNT(*) AS Downloads FROM Downloads GROUP BY Age, Genre As results, we’d get something like this: AgeGenreDownloads 18Horror12,945 18Comedy15,371 19Drama25,902 19Horror11,038 21Comedy37,408 ……… We could also use GROUP BY 3 columns, to find out (for example) genre preferences by gender and nationality: SELECT Gender, Nationality, Genre, COUNT(*) AS Downloads FROM Downloads GROUP BY Gender, Nationality, Genre And we would get something like this: GenderNationalityGenreDownloads MaleFrenchHorror102,044 MaleFrenchComedy149,290 MaleGermanHorror80,104 FemaleFrenchHorror91.668 FemaleGermanComedy50,103 FemaleGermanDrama61,440 OtherFrenchDrama77,993 OtherGermanComedy25,484 ………… GROUP BY with Multiple Columns Returns Faceted Information GROUP BY is a powerful tool for extracting insights from large data sets that are difficult to manipulate in any other way. By using GROUP BY multiple columns, you can leverage its full potential to expose the truths of a dataset, allowing you to see different facets of it. To do this successfully, it is critical that you understand – and know how to explain – what an SQL result set grouped by multiple columns represents. If you’re planning to do some serious data analysis work, then you should take our interactive SQL Basics course to learn about all the tools SQL can offer. Also, follow these links if you need further explanations of GROUP BY or want to see more examples of GROUP BY in SQL. Tags: sql learn sql GROUP BY