10th Apr 2020 13 minutes read Analyze Time Series COVID-19 Data with Window Functions Marija Ilic sql learn sql window functions Table of Contents Publicly Available COVID-19 Data Finding the Daily Number of Confirmed Cases SQL Constructions for Analyzing COVID-19 Time Series Data Total number of confirmed cases on a country and province level Creating a country-level summary of confirmed cases with ROLLUP Calculating a running total with OVER and PARTITION BY Calculating the daily percent change in confirmed cases Using RANK to find the highest number of confirmed cases Summary The COVID-19 pandemic really changed the way we live. No matter where you are, you're being impacted by this unpredictable virus. We do not know what the future holds, but we do believe in science and in better days to come. One thing that makes me as a data analyst particularly happy is that we have publicly available data. We can analyze that data and learn something from it. If you want to explore COVID-19 with the help of modern SQL constructions, keep reading — this article is intended for you. Publicly Available COVID-19 Data Today's news is filled with COVID-19 information. No matter which site you visit, the coronavirus topic is inevitable. This is not surprising; our situation is totally extraordinary and we all hope to return to normal someday. Until that "normal" appears, the coronavirus will stay a central topic of interest. For data analysts, the good news is that coronavirus-related time series data is publicly available and updated daily. We can download the data sets and draw our own conclusions from analyzing them. There are a lot of different tools and technologies available to make such an analysis. In this article, I will show how you can analyze the data with the help of advanced SQL techniques. We are going to use data published by Johns Hopkins University on their GitHub page. Before we start a short analysis, let's look at the data in the database. Finding the Daily Number of Confirmed Cases We'll use the confirmed_covid table in this article. It contains the total number of confirmed coronavirus cases per day for each country and province/state. The first date in the table is 22 January 2020. The data comes from Johns Hopkins University's official GitHub page, which is refreshed daily. The data has been transformed with R code and loaded into this MySQL table. The R code is available on my GitHub page. Here is how the data looks for Croatia (one of the affected countries) during March 2020: countryprovince_statedateconfirmed_day Croatia2020-03-011 Croatia2020-03-020 Croatia2020-03-032 Croatia2020-03-041 Croatia2020-03-050 Croatia2020-03-061 Croatia2020-03-071 Croatia2020-03-080 Croatia2020-03-090 Croatia2020-03-102 Croatia2020-03-115 Croatia2020-03-120 Croatia2020-03-1313 Croatia2020-03-146 Croatia2020-03-1511 Croatia2020-03-168 Croatia2020-03-178 Croatia2020-03-1816 Croatia2020-03-1924 Croatia2020-03-2023 Croatia2020-03-2178 Croatia2020-03-2248 Croatia2020-03-2361 Croatia2020-03-2467 Croatia2020-03-2560 Croatia2020-03-2653 Croatia2020-03-2791 Croatia2020-03-2871 Croatia2020-03-2956 Croatia2020-03-3077 There are four columns — country, province_state, date, and confirmed_day. The confirmed_day column stores the number of patients that tested positive for COVID-19 on a specific day (the column date). For example, on 29 March, Croatia had 56 new confirmed cases. On 30 March, it had 77 new cases. Note that the province_state column is blank for Croatia, as Croatia is reporting on a country level. This is not the case for China, Canada, or similar countries; for these, the province_state value will be defined. SQL Constructions for Analyzing COVID-19 Time Series Data In this section, we are going to use several advanced modern SQL techniques to get some additional insights from the confirmed_covid table. So, we have a number of newly confirmed cases on a daily basis. Now we'll use SQL to get insights like the ones you see on popular news portals, such as: The total number of confirmed cases to date for each country and province. A higher-level summary of confirmed cases. The cumulative sum of confirmed cases. The percent change (compared to the previous day) of confirmed cases. Which countries had the highest number of confirmed cases in one particular day. All answers will be given with SQL; we'll use window functions and common table expressions – aka modern SQL techniques. We'll show that SQL can be used in complex time series analysis. With the right techniques, some facts can be extracted easily, using only a few lines of code. Total number of confirmed cases on a country and province level If you are an advanced beginner in SQL (as most of us are), you are probably familiar with traditional SQL statements like SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. If we want to calculate the total number of confirmed cases on a country and province level, we can use an ordinary GROUP BY clause: SELECT country, province_state, SUM(confirmed_day) AS total_confirmed FROM confirmed_covid GROUP BY country, province_state; So, when using GROUP BY clause data are divided into groups. Rows with the same GROUP BY column (country and province_state in our example) are put into one group. For each group, the total sum is calculated and returned as a result. Rows are collapsed, that is, merged; for each group, one row is displayed as the result. From our original table, one row will be returned for each country and province/state. Here is how the result is going to be displayed for China and Colombia: After GROUP BY is executed, Colombia is presented as one row (provinces are not defined for Colombia). On the other hand, for China we have several rows displayed after the GROUP BY is executed; this is because each province_state value's total sum is displayed. This simple report is produced with a simple GROUP BY statement. If you want to learn more about GROUP BY aggregations, I suggest that you try the interactive online course SQL Basics on LearnSQL.com. If we want to go one step further, we could also display a higher-level summary – the total confirmed cases on a country level and the total number of confirmed cases worldwide. Let's see how we can do that by using the ROLLUP clause. Creating a country-level summary of confirmed cases with ROLLUP We can add a higher-level summary to this output . If you want to see all three totals (by province, country, and total worldwide) you could write these three statements: SELECT country, province_state, SUM(confirmed_day) AS total_confirmed FROM confirmed_covid GROUP BY country,province_state; SELECT country, province_state, SUM(confirmed_day) AS total_confirmed FROM confirmed_covid GROUP BY country; SELECT country, province_state, SUM(confirmed_day) AS total_confirmed FROM confirmed_covid; The first result would give you the total sum of confirmed cases on a country and province/state level, a second one on the country level, and a third one worldwide. However, you can write simpler code by using the ROLLUP() clause: SELECT country, province_state, sum(confirmed_day) AS total_confirmed FROM confirmed_covid GROUP BY ROLLUP(country, province_state); Using group by ROLLUP(country, province_state), SQL will generate all three GROUP BY clauses and the output will look like this: As you can see in this output, there are some rows where province_state is marked NULL. There is also a row where both country and province_state are NULL. Those are higher-level summaries. All rows where province_state is marked NULL are rows that represent the total sum of confirmed cases per country. The row where both country and province_state are NULL is the total sum of confirmed cases worldwide. ROLLUP() is neat because you are getting everything that you need in one output, using one simple line of code. If you want to learn more about SQL's advanced GROUP BY extensions, check out LearnSQL's GROUP BY extensions in SQL course. Calculating a running total with OVER and PARTITION BY Currently in confirmed_covid, we have each day's total number of confirmed COVID19 cases for each country and province. In the last two examples, we saw how to get the total number of confirmed cases on the country and province/state levels. But what about running totals? In our next example, we'll write an SQL statement that will tell us how many total confirmed COVID-19 cases we had on a specific day on the country and province/state level. Here's the code: SELECT *, SUM(confirmed_day) OVER( PARTITION BY country, province_state ORDER BY date ) AS running_total FROM confirmed_covid; A running total (or cumulative sum) is calculated in SQL using aggregate window functions. Like the aggregate function used with the GROUP BY clause that we saw earlier, a window function also performs calculations across a set of rows. However, this result is not presented as a single output row per group, i.e., rows are not collapsed in the result table. Instead, each row from the input table is returned and the resulting value is assigned to each row in the table. A window function is denoted by the OVER clause in combination with the PARTITION BY and ORDER BY keywords. Here is the short description of what these keywords do: PARTITION BY tells us how rows are grouped into logical chunks/groups. In our example, rows are grouped on the country and province_state level. ORDER BY sets up the order of the rows. In our example, all rows are sorted by the date column. ROW or RANGE is not explicitly defined in this example, but these keywords signify that each window frame inside one partition comprises all rows from the start of the partition to the current row (for each row, the total sum is calculated as the sum of the confirmed_day values from the first row in the partition until the current row). With this SQL statement (OVER in combination with PARTITION BY and ORDER BY), we set up a running total of confirmed cases on the country and province/state levels. Let's use an example to explain this. Here is what we got after running the SQL statement for Croatia: On 25 February, Croatia had its first COVID-19 patient. The next day, there were two more patients, which brought the total to three confirmed cases on that day (26 February). On 5 March, Croatia had 10 confirmed cases in total; that's from the beginning of the pandemic (even though Croatia didn't detect a new case of COVID-19 on that day). A new case arrived on 6 March, and on that day Croatia had 10 + 1 = 11 detected cases. For each day, we now can detect how many confirmed cases we had in total and how that number changed through time. Keep in mind that we made this calculation for each country and province/state separately, which means that running totals are available for each country in the confirmed_covid table. For a detailed look at SQL window functions, I recommend LearnSQL's interactive Window Functions course. Calculating the daily percent change in confirmed cases Let's now calculate the percent change in new confirmed cases on a daily basis. For example, if today we have 10 new confirmed cases and yesterday we had five, our percent change is calculated like this: (10–5)/5 * 100 = 100% This means that we had a 100% increase in the number of daily confirmed cases. To calculate this number, we need to assign to each day in our table a value for the cases detected the previous day. We can do this using the LAG() window function: SELECT *, LAG(confirmed_day) OVER( PARTITION BY country, province_state ORDER BY date) AS confirmed_previous_day FROM confirmed_covid; Again, rows are grouped on the country and province level and ordered by the date column. After running this, SQL displays all rows from the confirmed_covid table plus the additional info contained in confirmed_previous_day (i.e., the confirmed_day value from the previous row). Here are several rows for Croatia: Each row now contains the number of confirmed new cases from the previous day. When we have this information, we can easily calculate the percent change using the formula: (confirmed_day - confirmed_previous_day)/ confirmed_previous_day * 100 Below is the complete SQL statement. Note that we are using common table expressions because we want to divide this relatively long query into several smaller pieces: WITH confirmed_lag AS ( SELECT *, LAG(confirmed_day) OVER( PARTITION BY country, province_state ORDER BY date ) AS confirmed_previous_day FROM confirmed_covid ), confirmed_percent_change AS ( SELECT *, COALESCE(ROUND((confirmed_day — confirmed_previous_day) / confirmed_previous_day * 100), 0) AS percent_change FROM confirmed_lag ) SELECT *, CASE WHEN percent_change > 0 THEN 'increase' WHEN percent_change = 0 THEN 'no change' ELSE 'decrease' END AS trend FROM confirmed_percent_change WHERE country = 'Croatia'; Here is a short explanation of what's going on: The WITH statement tells SQL that we are creating a temporary named result set. This is also called a common table expression or CTE. A temporary result set is the result of a query that is written inside brackets (). In our example, we are using nested CTEs, i.e., we are creating two temporary result sets named confirmed_lag (assigns the previous day's value to the current row) and confirmed_percent_change (uses the confirmed_lag result set and calculates the percent change for each day). At the end, a second temporary result set is used. Then, the final SELECT statement displays the output: SELECT *, CASE WHEN percent_change > 0 THEN 'increase' WHEN percent_change = 0 THEN 'no change' ELSE 'decrease' END AS trend FROM confirmed_percent_change WHERE country = 'Croatia'; The result after running this nested CTE looks like this: countryprovince_statedateconfirmed_dayconfirmed_previous_daypercent_changetrend Croatia2020-01-220 NULL 0no change ... ... ...... ...... Croatia2020-03-2248 78 -38decrease Croatia2020-03-2361 48 27increase Croatia2020-03-2467 61 10increase Croatia2020-03-2560 67 -10decrease Croatia2020-03-2653 60 -12decrease Croatia2020-03-2791 53 72increase Croatia2020-03-2871 91 -22decrease Croatia2020-03-2956 71 -21decrease Croatia2020-03-3077 56 38increase Common table expressions are really useful when you want to organize your code in a more readable way, as they allow long queries to be broken into smaller logical units. Using RANK to find the highest number of confirmed cases The last thing that we are going to do is display the countries with the highest number of confirmed cases on a specific day from 20 March to 30 March. The idea is to rank countries by the column confirmed_day. We want the country with the highest number of confirmed cases on a specific day to have rank 1, the second highest to have rank 2, and so on) Here is the query, followed by a short explanation: WITH highest_no_of_confirmed AS ( SELECT *, RANK() OVER( PARTITION BY date ORDER BY confirmed_day DESC ) AS highest_no_confirmed FROM confirmed_covid ) SELECT * FROM highest_no_of_confirmed WHERE highest_no_confirmed = 1 AND date BETWEEN '2020-03-20' AND '2020-03-30'; We are using a WITH statement. This creates a temporary result set named highest_no_of_confirmed. A SELECT statement creates a temporary result set named highest_no_of_confirmed using a RANK() window function. For each day, countries and provinces are ranked based on the confirmed_day column. This time, our rows are partitioned by the date column. Rows are ordered by the total number of new confirmed cases on a specific day. The country with the highest number of confirmed cases in a day is marked with 1, the second highest with 2, and so on. We are taking only the countries ranking 1 in the highest_no_of_confirmed result set. And we are looking only at ten days in March. Here is the result: countryprovince_statedateconfirmed_dayhighest_confirmed Italy2020-03-205986 1 Italy2020-03-216557 1 US 2020-03-227787 1 US 2020-03-23105711 US 2020-03-249893 1 US 2020-03-25120381 US 2020-03-26180581 US 2020-03-27178211 US 2020-03-28198211 US 2020-03-29194081 US 2020-03-30209211 We can see that Italy and the USA were hit hard in March. Summary In this blog post, I wanted to show how easy it can be to analyze publicly available COVID-19 time series data with the help of SQL window functions, common table expressions, and GROUP BY extensions. My idea was to show you that SQL is more than just SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. There are modern SQL techniques that help us perform complex analyses. You can continue to play with the data provided by Johns Hopkins. Other files are also available on my GitHub page. If you want to learn more about window functions and other modern SQL techniques, I suggest you visit LearnSQL.com and try the Advanced SQL mini track. Also, you may want to read our guide on how to practice advanced SQL with our platform. Tags: sql learn sql window functions