16th Apr 2020 8 minutes read Differences Between GROUP BY and PARTITION BY Emil Drkušić sql learn sql PARTITION BY GROUP BY Table of Contents PARTITION BY vs. GROUP BY GROUP BY PARTITION BY Window Functions PARTITION BY and GROUP BY: Similarities and Differences Window functions are a great addition to SQL, and they can make your life much easier if you know how to use them properly. Today, we will address the differences between a GROUP BY and a PARTITION BY. We’ll start with the very basics and slowly get you to a point where you can keep researching on your own. PARTITION BY vs. GROUP BY The PARTITION BY and the GROUP BY clauses are used frequently in SQL when you need to create a complex report. While returning the data itself is useful (and even needed) in many cases, more complex calculations are often required. This is where GROUP BY and PARTITION BY come in. Although they are very similar in that they both do grouping, there are key differences. We will analyze these differences in this article. GROUP BY The GROUP BY clause is used in SQL queries to define groups based on some given criteria. These criteria are what we usually find as categories in reports. Examples of criteria for grouping are: group all employees by their annual salary level group all trains by their first station group incomes and expenses by month group students according to the class in which they are enrolled Using the GROUP BY clause transforms data into a new result set in which the original records are placed in different groups using the criteria we provide. You can check out more details on the GROUP BY clause in this article. We can perform some additional actions or calculations on these groups, most of which are closely related to aggregate functions. As a quick review, aggregate functions are used to aggregate our data, and therefore in the process, we lose the original details in the query result. There are many aggregate functions, but the ones most commonly used are COUNT, SUM, AVG, MIN, and MAX. If you want to practice using the GROUP BY clause, we recommend our interactive course Creating Reports in SQL. Aggregate functions and the GROUP BY clause are essential to writing reports in SQL. Let’s consider the following example. Here we have the train table with the information about the trains, the journey table with the information about the journeys taken by the trains, and the route table with the information about the routes for the journeys. See below—take a look at the data and how the tables are related: Let’s run the following query which returns the information about trains and related journeys using the train and the journey tables. SELECT train.id, train.model, journey.* FROM train INNER JOIN journey ON journey.train_id = train.id ORDER BY train.id ASC; Here is the result: idmodelidtrain_idroute_iddate 1InterCity 1001111/3/2016 1InterCity 10025151/3/2016 1InterCity 1002121/4/2016 1InterCity 1003131/5/2016 1InterCity 1004141/6/2016 2InterCity 1006231/4/2016 2InterCity 1007241/5/2016 2InterCity 1008251/6/2016 2InterCity 1005221/3/2016 3InterCity 12510351/4/2016 3InterCity 12511351/5/2016 3InterCity 12529341/3/2016 3InterCity 12527331/5/2016 3InterCity 12512361/6/2016 3InterCity 1259331/3/2016 4Pendolino 39016471/6/2016 4Pendolino 39013441/4/2016 4Pendolino 39014451/4/2016 4Pendolino 39015461/5/2016 4Pendolino 39028461/6/2016 You can see that the train with id = 1 has 5 different rows, the train with id = 2 has 4 different rows, etc. Now, let’s run a query with the same two tables using a GROUP BY. SELECT train.id, train.model, COUNT(*) AS routes FROM train INNER JOIN journey ON journey.train_id = train.id GROUP BY train.id, train.model ORDER BY train.id ASC; And the result is the following: idmodelroutes 1InterCity 1005 2InterCity 1004 3InterCity 1256 4Pendolino 3905 From the query result, you can see that we have aggregated information, telling us the number of routes for each train. In the process, we lost the row-level details from the journey table. You can compare this result set to the prior one and check that the number of rows returned from the first query (number of routes) matches the sum of the numbers in the aggregated column (routes) of the second query result. Although you can use aggregate functions in a query without a GROUP BY clause, it is necessary in most cases. Aggregate functions work like this: You generate groups using a GROUP BY statement by specifying one or more columns that have the same value within each group. The aggregate function calculates the result. The original rows are “collapsed.” You can access the columns in the GROUP BY statement and the values produced by the aggregate functions, but the original row-level details are no longer there. “Collapsing” the rows is fine in most cases. Sometimes, however, you need to combine the original row-level details with the values returned by the aggregate functions. This can be done with subqueries by linking the rows in the original table with the resulting set from the query using aggregate functions. Or, you could try a different approach—we will see this next. PARTITION BY Depending on what you need to do, you can use a PARTITION BY in our queries to calculate aggregated values on the defined groups. The PARTITION BY is combined with OVER() and windows functions to calculate aggregated values. This is very similar to GROUP BY and aggregate functions, but with one important difference: when you use a PARTITION BY, the row-level details are preserved and not collapsed. That is, you still have the original row-level details as well as the aggregated values at your disposal. All aggregate functions can be used as window functions. Let’s look at the following query. In addition to train and journey, we now incorporate the route table as well. SELECT train.id, train.model, route.name, route.from_city, route.to_city, COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes, COUNT(*) OVER () AS routes_total FROM train INNER JOIN journey ON journey.train_id = train.id INNER JOIN route ON journey.route_id = route.id; Here is the result of the query: idmodelnamefrom_cityto_cityroutesroutes_total 1InterCity 100Manchester ExpressSheffieldManchester530 1InterCity 100BeatlesRouteLiverpoolYork530 1InterCity 100GoToLeadsManchesterLeeds530 1InterCity 100StudentRouteLondonOxford530 1InterCity 100MiddleEnglandWayLondonLeicester530 2InterCity 100StudentRouteLondonOxford430 2InterCity 100MiddleEnglandWayLondonLeicester430 2InterCity 100BeatlesRouteLiverpoolYork430 2InterCity 100GoToLeadsManchesterLeeds430 3InterCity 125BeatlesRouteLiverpoolYork630 3InterCity 125BeatlesRouteLiverpoolYork630 3InterCity 125MiddleEnglandWayLondonLeicester630 3InterCity 125StudentRouteLondonOxford630 3InterCity 125NewcastleDailyYorkNewcastle630 3InterCity 125StudentRouteLondonOxford630 4Pendolino 390ScotlandSpeedNewcastleEdinburgh530 4Pendolino 390MiddleEnglandWayLondonLeicester530 4Pendolino 390BeatlesRouteLiverpoolYork530 4Pendolino 390NewcastleDailyYorkNewcastle530 4Pendolino 390NewcastleDailyYorkNewcastle530 5Pendolino ETR310StudentRouteLondonOxford530 From the result set, we note several important points: We did not use a GROUP BY but still obtained aggregated values (routes and routes_total). We have the same columns (id and model) from the GROUP BY in the previous query, but the original row-level details were preserved. The aggregated values are repeated in all rows with the same values of id and model. This is expected; as an example, we have 5 journey records for id = 1, all of which have identical values for these columns. We also have values in the columns name, from_city, and to_city that are different within a given value of id. Had we used a GROUP BY on the columns id and model, these row-level details would be lost. COUNT(*) OVER () AS routes_total produced the same aggregate count, 30, as COUNT and GROUP BY would do. In this result set, however, this value is included in each row. The part COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes is very interesting. We have defined the group over which this window function should be used with the PARTITION BY clause. Therefore, in the routes column, we have a count of rows for only that group. Window functions are applied after the rows are filtered, thereby keeping row-level details while still defining the groups through PARTITION BY. Using standard aggregate functions as window functions with the OVER() keyword allows us to combine aggregated values and keep the values from the original rows. We can accomplish the same using aggregate functions, but that requires subqueries for each group or partition. It is important to note that all standard aggregate functions can be used as window functions like this. Window Functions Besides aggregate functions, there are some other important window functions, such as: ROW_NUMBER(). Returns the sequence number of the row in the result set. RANK(). Similar to ROW_NUMBER(), but can take a column as an argument. The rank order is determined over the value of this column. If two or more rows have the same value in this column, these rows all get the same rank. The next rank will continue from the equivalent number of rows up; for example, if two rows share a rank of 10, the next rank will be 12. DENSE_RANK(). Very similar to RANK(), except it doesn’t have “gaps.” In the previous example, if two rows share a rank of 10, the next rank will be 11. NTILE. Used to calculate quartiles, deciles, or any other percentiles. LAG & LEAD. Used to pull values from the previous (LAG) or the following (LEAD) row. There is no general rule about when you should use window functions, but you can develop a feel for them. I definitely recommend going through the Window Functions course; there, you will find all the details you will want to know! PARTITION BY and GROUP BY: Similarities and Differences Although we use a GROUP BY most of the time, there are numerous cases when a PARTITION BY would be a better choice. In some cases, you could use a GROUP BY using subqueries to simulate a PARTITION BY, but these can end up with very complex queries. Let’s wrap everything up with the most important similarities and differences: Similarity: Both are used to return aggregated values. Difference: Using a GROUP BY clause collapses original rows; for that reason, you cannot access the original values later in the query. On the other hand, using a PARTITION BY clause keeps original values while also allowing us to produce aggregated values. Difference: The PARTITION BY is combined with OVER() and windows functions to add a lot more functionalities. Tags: sql learn sql PARTITION BY GROUP BY