2nd May 2023 12 minutes read Top 7 Advanced SQL Queries for Data Analysis Nicole Darnley sql data analysis Table of Contents 7 Advanced SQL Queries Data Analysts Should Know 1. Grouping Data by Time Period 2. Create Multiple Grouping Levels Using ROLLUP 3. Ranking Data Using Window Functions 4. Computing the Difference (Delta) Between Rows 5. Computing a Running Total 6. Computing a Moving Average 7. Counting Elements in Custom Categories Using SUM() and CASE WHEN Power Up Your Data Analysis with Advanced SQL Queries Explore essential advanced SQL queries for data analysis. Structured Query Language, or SQL, is an indispensable tool for data analysts. Most people pick up the language relatively quickly and can begin in-depth data analysis after just a few lessons. Because of this, many data analysts tend to stay in the beginner/intermediate level of using SQL. This level of understanding allows you to “get the job done”, but it might not be the most efficient way to write a query. Mastering advanced SQL techniques can greatly enhance your data analysis capabilities, allowing you to delve deeper into your datasets and derive valuable insights. In this article, we will explore seven advanced SQL queries that can be useful for data analysis. For each example, we will present the query and its results, discuss the syntax used, and provide a brief explanation of how the database computes the result. If you’d like to improve your SQL skills, be sure to check out our SQL Reporting track. It is a great way to learn and practice more advanced SQL functions. 7 Advanced SQL Queries Data Analysts Should Know 1. Grouping Data by Time Period In this first example, let’s examine data in the table sales. This dataset includes an ID for each product, the date of the transaction, and the total amount of the transaction. product_idsale_dateamount 12021-01-01100 22021-01-15200 12021-02-01300 22021-02-15400 12022-01-10200 12022-02-05100 22022-01-27200 22022-02-12400 One very common data analysis scenario is to take raw data like this and aggregate it by specific date periods, such as month or year. Let’s go ahead and do just that by running the following query: SELECT EXTRACT(YEAR FROM sale_date) AS year, EXTRACT(MONTH FROM sale_date) AS month, SUM(amount) AS total_sales FROM sales GROUP BY year, month ORDER BY year, month; This query will return these results: yearmonthamount 202101300 202102700 202201400 202202500 In this query, we use the EXTRACT() function to pull the year and month from the sale_date field. We then group the data by the new year and month fields and calculate the total sales for each group using the SUM() function. Note that we need to group by both year and month to get accurate results. If we were to only pull in the month, the results would combine the values for specific months across the years (so all Januaries from all years would be combined into one row, all Februaries would be combined together, etc). When we group the results by both year and month, the months from different years are put into separate rows. We then order the rows by year and month to have ordered results. Different database engines often have different and sometimes better functions to accomplish the same result, such as DATE_TRUNC() in PostgreSQL or TRUNC() in Oracle. 2. Create Multiple Grouping Levels Using ROLLUP Sometimes you may want to group data by multiple levels in a single query. Let’s say you want to add the total sales (across all years) as well as the total sales in each year to the previous example. You can add additional rows for the grand total and totals for each year by using the ROLLUP keyword. Using the same dataset, we’ll run this query: SELECT EXTRACT(YEAR FROM sale_date) AS year, EXTRACT(MONTH FROM sale_date) AS month, SUM(amount) AS total_sales FROM sales GROUP BY ROLLUP(year,month) ORDER BY year, month; Result: yearmonthtotal_sales 202101300 202102700 2021NULL1000 202201400 202202500 2022NULL900 NULLNULL1900 In our example, ROLLUP adds three extra rows: one for the grand total of all sales and two for the total sales of each year. In our result set, the third row is the yearly total for 2021: the value in the year column is 2021 and the value in the month column is NULL. The sixth row is the total for 2022: the value in the year column is 2022 and the value in the month column is NULL. The final row is the total for both years: it has NULL in both the year and month columns. These rows were added by ROLLUP. The NULL values in the year and month columns indicate the aggregated rows. 3. Ranking Data Using Window Functions Ranking data is a common requirement in advanced data analysis. A few business cases would be ranking your products by the highest sales to understand which products bring in the most revenue or ranking stores by the lowest sales to understand which stores are your lowest performers. You can use window functions like RANK() or DENSE_RANK() to assign ranks to rows based on a specific column's value. First, let’s take a look at the RANK() function: SELECT product_id, SUM(amount) AS total_sales, RANK() OVER(ORDER BY SUM(amount) DESC) AS rank FROM sales GROUP BY id; Result: product_idtotal_salesrank 212001 17002 This query first calculates the total sales for each product_id and then ranks them in descending order. The RANK() function is a window function used to compute the ranking. The OVER() clause is the syntax used with window functions. The RANK() function assigns a unique rank to each row within a result set, based on specific ordering. The ORDER BY clause in the OVER() clause specifies the order in which the rank will be assigned, based on one or more columns. In our example we rank the rows by the total sales amount. The same syntax can be used for the DENSE_RANK() function. The difference between the two is in the scenario where two values are equal. If two values tie for first place, RANK() will skip to 3 for the third row, while DENSE_RANK() would assign the third row as 2. For more information about ranking functions, read our in-depth article Overview of Ranking Functions in SQL. You can also filter the top X rows by using a common table expression (CTE) and a WHERE clause. For example, we can run the following to only pull our lowest-ranked row: WITH sales_cte AS ( SELECT product_id, SUM(amount) AS total_sales, RANK() OVER (ORDER BY SUM(amount) DESC) AS rank FROM sales GROUP BY id ) SELECT * FROM sales_cte WHERE rank = 1; In this query, we’re defining a CTE called sales_cte, which gives the rank of each product_id based on its total sales. We then query this CTE, only pulling the rows where rank = 1. This allows us to only return the top row. 4. Computing the Difference (Delta) Between Rows Sometimes you need to compare the values of consecutive rows. One example would be pulling a previous month’s sales to compare to the current month. Having both the current month’s sales and the previous month’s sales allows you to compute the difference in the two values. You can use window functions like LEAD() and LAG() to achieve this. We’ll use a different table this time: sales_agg. This table has the aggregate sales for each year and month. yearmonthtotal_sales 20211300 20212700 20221400 20222500 Let's compute the month-over-month sales difference using LAG(): SELECT year, month, total_sales, LAG(total_sales) OVER(ORDER BY year, month) AS previous_month, total_sales – LAG(total_sales) OVER(ORDER BY year, month) AS sales_difference FROM sales_agg ORDER BY year, month; This query uses the LAG() function to calculate the difference between the current month's sales and the previous month's sales: yearmonthtotal_salesprevious_monthsales_difference 202101300NULLNULL 202102700300400 202201400700-300 202202500400100 The LAG() function is a window function. It pulls the total_sales value from the previous row, ordered by year and month. We’re aliasing that column as previous_month. Then the sales_difference column is calculated by taking the total_sales for each row and then subtracting the previous_month value. The first row has NULL values, since there is no previous row. If you needed to pull the next month’s sales, you would use the window function LEAD() in place of LAG(). LAG() compares the current row to previous values; LEAD() compares the current row to subsequent values. How to Calculate the Difference Between Two Rows in SQL gives multiple real-life examples of using this syntax to compute the difference between two rows. 5. Computing a Running Total A running total calculates the sum of a sequence of numbers. It’s also known as a cumulative total or cumulative sum; it adds each new value to the previous total. Running totals are useful in calculating the cumulative effect of data points over time. For example, you might want to calculate the cumulative number of users that have visited your website up until a certain time to understand your website’s user growth. Let’s revisit the SUM() function and see how we could use it to calculate a running sum of sales. As with the previous example, we’ll need to use SUM() as a window function to achieve the desired results. SELECT year, month, total_sales, SUM(total_sales) OVER(ORDER BY year, month) AS running_total FROM sales_agg ORDER BY year, month; This running total is calculated using the SUM() function with the OVER() clause. This adds the total_sales from the current row to all previous rows in the specified order. In the second row, the running total is calculated by aggregating the total_sales value from both the first and second row. In our example, the second row is for February of 2021. The running_total value is the sum of the first row (for January 2021) and the second row (for February 2021). In the third row, the running total is calculated by aggregating the values from the first through third rows. This same pattern continues through each row. yearmonthtotal_salesrunning_total 202101300NULL 2021027001000 2022014001400 2022025001900 For more information on computing running totals in SQL, check out our article What Is a SQL Running Total and How Do You Compute It?. 6. Computing a Moving Average When looking at sales trends in data analysis, it is often helpful to use a moving average rather than each individual data point. A moving average (also known as a rolling average) calculates the average of the current value and a specified number of immediately preceding values. This technique helps smooth out data and identify trends, especially when your data has high volatility. The main idea is to examine how these averages behave over time instead of examining the behavior of the original data points. For example, you might need to analyze daily sales of a restaurant that has high sales on the weekend but low sales Monday through Wednesday. If you were to plot each individual data point, you would see very high values and very low values next to each other, making long-term trends more difficult to see. By using a 3-day moving average, you would take the average of the past 3 days, which evens out the highs and lows. For this example, we’ll modify our sales_agg table. yearmonthtotal_sales 202101300 202102700 202103500 2021041000 202105800 202106600 Now let’s calculate a 3-month moving average of sales: SELECT year, month, total_sales, AVG(total_sales) OVER(ORDER BY year, month ROWS BETWEEN 2 PRECEDING and CURRENT ROW) AS moving_average FROM aales_agg ORDER BY year, month; This query calculates the moving average using the AVG() function with the OVER() clause. The ROWS BETWEEN clause specifies the range of rows to include in the moving average calculation – in this case, the current row and the two preceding rows. The result: yearmonthtotal_salesmoving_average 202101300300 202102700500 202103500500 2021041000733 202105800767 202106600800 Since there are no previous rows, the first row’s moving average is just the total sales value. The second row’s moving average is the average of 300 and 700. For the third row, we now have the 2 preceding rows, as defined in our query; the moving average is calculated by averaging 300, 700, and 500. This pattern continues for the rest of the rows. What a Moving Average Is and How to Compute It in SQL is a great resource for more information on this topic. 7. Counting Elements in Custom Categories Using SUM() and CASE WHEN You can count elements in custom categories by combining SUM() with CASE WHEN. You’d use this when you need to create business logic that doesn’t exist in your data. For example, you might want to group specific locations by custom regions and then compute metrics based on those regions. CASE WHEN statements allow you to perform conditional logic in queries. The syntax is structured like this: CASE WHEN condition_1 THEN result_1 WHEN condition_2 THEN result_2 END The conditions are evaluated from top to bottom. The first condition that evaluates as true determines which result is returned. In our example, let's create a sales category breakdown based on sales amounts: SELECT SUM(CASE WHEN total_sales < 400 THEN 1 ELSE 0 END) AS low_sales, SUM(CASE WHEN total_sales >= 400 AND total_sales < 700 THEN 1 ELSE 0 END) AS medium_sales, SUM(CASE WHEN total_sales >= 700 THEN 1 ELSE 0 END) AS high_sales FROM sales_agg In the above query, we use the CASE WHEN expression to categorize each month into low_sales, medium_sales, or high_sales categories. When the condition in the CASE WHEN expression is met, it returns a value of 1; otherwise, it returns 0. The SUM() function is then used to add up the 1s and 0s for each category, effectively counting the number of months that fall into each category. The result is a single row with the counts of low_sales, medium_sales, and high_sales months. low_salesmedium_saleshigh_sales 123 This approach allows you to create custom categories and count the elements in each category with a single query. The combination of SUM() and CASE WHEN is versatile and can be adapted to different use cases. If you want to use this construction in your query, How to Use CASE WHEN with SUM() in SQL will give you more details. Power Up Your Data Analysis with Advanced SQL Queries Mastering advanced SQL queries is essential for efficient and accurate data analysis. In this article, we've covered seven powerful SQL techniques that can help you gain deeper insights into your data. By learning how to group data by time periods, use ROLLUP for multiple grouping levels, rank data with window functions, compute differences between rows, calculate running totals and moving averages, and count elements in custom categories, you'll be well-equipped to tackle complex data analysis tasks. As you continue to hone your SQL skills, you'll find even more ways to leverage this powerful tool to unlock the full potential of your datasets and drive better decision-making in your organization. For more practice and to gain an even deeper level of understanding on these topics, sign up for our SQL Reporting track. It is a great way to master these advanced SQL functions. Read our guide on how to practice advanced SQL online to discover more advanced SQL practice opportunities on our platform. Tags: sql data analysis