9th Jul 2020 12 minutes read What Is a CTE? Marija Ilic sql learn sql CTE Table of Contents CTE: Definition and Basic Syntax Subqueries vs. CTEs Multiple CTEs in a Statement Table Creation vs. CTEs Non-Recursive and Recursive CTEs Code Readability and CTEs After mastering statements like SELECT, DELETE, INSERT, and GROUP BY, you might search for ways to improve code maintainability, reproducibility, and readability. At that point, you will probably start learning about modern SQL concepts that were introduced in the early 2000s. One such SQL technique is the CTE? (common table expression) —?a temporary named result set. In this article, you will learn what a CTE is and how to use it to improve the maintenance and the readability of your code. CTE: Definition and Basic Syntax A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements. Let's start with an example. Consider the following two tables: job_offers: a table that contains data science roles, with salaries by level and location in the UK. rolelocationlevelsalary data scientistLondonentry45000 data scientistoutside Londonentry34000 data scientistLondonmid65000 data scientistoutside Londonmid60000 data scientistLondontech lead95000 data scientistoutside Londontech lead73000 data scientistLondondirector140000 data scientistoutside Londondirector101000 quantitative analystLondonentry55000 quantitative analystoutside Londonentry43000 quantitative analystLondonmid83000 quantitative analystoutside Londonmid66000 quantitative analystLondontech lead100000 quantitative analystoutside Londontech lead72000 quantitative analystLondondirector155000 quantitative analystoutside Londondirector120000 machine learning engineerLondonentry44000 machine learning engineeroutside Londonentry36000 machine learning engineerLondonmid67000 machine learning engineeroutside Londonmid58000 machine learning engineerLondontech lead95000 machine learning engineeroutside Londontech lead84000 employee_occupation: a table that contains 5 employees from Company X, with their respective occupations and locations of work. namerolelocation Tim Smithdata scientistLondon Joana Loquantitative analystoutside London Ed Merithmachine learning engineerLondon Maria Soldinimachine learning engineeroutside London Tina Moritoquantitative analystoutside London Now, imagine that you would like to list all employees with the average salary that corresponds to their specific role (column role). What you would need to do is: Calculate the average salary for each role. For that, you can use the job_offers table. Merge the aggregated values with the employee_occupation table (i.e., join them together). Here is the CTE for calculating and displaying the desired output: WITH average_salary AS ( SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role ) SELECT a.*, b.avg_salary FROM employee_occupation a LEFT JOIN average_salary b ON a.role = b.role; Here’s what is happening in the code: The CTE definition starts with a special keyword WITH. After the WITH, the CTE is given a name. In our example, the CTE name is average_salary. The CTE name is followed by a special keyword AS. The SELECT statement is inside the parentheses, whose result set is stored as a CTE. In our example, the temporary result set average_salary is created with a SELECT. Now that you have a CTE, you can use it in a subsequent SQL statement by referencing it as you would any other table. Here, we use our temporary result set in a JOIN statement. The main query follows, in which the CTE average_salary is used. The temporary result set average_salary lives only within the scope of the statement immediately following the WITH clause. In our example, this means that average_salary automatically disappears after the main query is executed and cannot be used anywhere else in the code. It is as if you create a table, use it in one SQL statement, then delete or drop it from your database. As a result, SQL returns for each employee the average salary for their role: namerolelocationavg_salary Tim Smithdata scientistLondon76625 Ed Merithmachine learning engineerLondon64000 Maria Soldinimachine learning engineeroutside London64000 Joana Loquantitative analystoutside London86750 Tina Moritoquantitative analystoutside London86750 The basic syntax for CTE usage looks like this: As you can see from the image, we define a temporary result set (in our example, average_salary) after which we use it in that same statement (in our example, average_salary is used in the JOIN statement). There is a great course, Recursive queries, offered by LearnSQL.com; I suggest starting with this if you want to learn more about the CTE syntax. Subqueries vs. CTEs You may have noticed that our last example can be written with a subquery like this: SELECT a.*, b.avg_salary FROM employee_occupation a LEFT JOIN (SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role) b ON a.role = b.role; Although written differently, it will return the exact same output as our CTE code from earlier. Here is what this code with a subquery does: Inside the JOIN statement, we have a subquery in parentheses: “select role, avg(salary) as avg_salary from job_offers group by role” is a subquery. The SQL engine runs this subquery first, then performs the join with the result returned from the subquery. Although a subquery returns the same result as a statement that uses a CTE, I always recommend CTEs over subqueries in a case like this one. Why? Well, they make the code more readable and understandable. Long and complicated queries can have many, many subqueries which quickly become hard to read and follow. Multiple CTEs in a Statement So far, we have used just one temporary result set in a WITH clause. What if we have a need to create more CTEs in a single statement? This can happen if you need to rewrite an SQL statement that has many subqueries. Well, you can do that — it is possible to have more CTEs in a single SQL statement. Let's go through an example with two subqueries. Imagine that you notice the salary differences by location in addition to the difference by role. You want to display additional information in your output, namely the average salary by role and location, and display it with employee-level information. In this case, in addition to the subquery like the following that calculates the average salary by role: SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role you need to write another subquery like this: SELECT role, location, avg(salary) AS avg_salary FROM job_offers GROUP BY role, location Using these subqueries, your final SELECT will look like this: SELECT a.*, b.avg_salary, c.avg_salary_additional FROM employee_occupation a LEFT join (SELECT role, avg(salary) as avg_salary FROM job_offers GROUP BY role) b ON a.role = b.role LEFT JOIN (SELECT role, location, avg(salary) AS avg_salary_additional FROM job_offers GROUP BY role, location) c ON a.role = c.role AND a.location = c.location; The final query now looks much more complicated; it is harder to follow. Here is what the same query looks like using CTEs: WITH average_salary AS ( SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role ), average_salary_additional AS ( SELECT role, location, avg(salary) AS avg_salary_additional FROM job_offers GROUP BY role, location ) SELECT a.*, b.avg_salary, c.avg_salary_additional FROM employee_occupation a LEFT JOIN average_salary b ON a.role = b.role LEFT JOIN average_salary_additional c ON a.role = c.role and a.location = c.location; Looks better, right? Here, we have defined two CTEs. The following is what is going on in this code: Two CTEs are defined and used inside a single WITH clause: average_salary and average_salary_additional. They are separated by a comma inside the same WITH statement. Each subquery is named separately, making your final SELECT more readable. The first SQL CTE (average_salary) stores the average salary by role. The second CTE (average_salaries_additional) stores the average salary for each combination of role and location in the table job_offers. Once defined, both CTEs are used in the main query in a single statement. The result displays all averages—the average by role and the average for each combination of role and location—with the details of each employee: namerolelocationavg_salaryavg_salary_additional Tim Smithdata scientistLondon7662586250 Ed Merithmachine learning engineerLondon6400068666.6667 Maria Soldinimachine learning engineeroutside London6400059333.3333 Joana Loquantitative analystoutside London8675075250 Tina Moritoquantitative analystoutside London8675075250 Once this single SQL statement is executed and the result is displayed on the screen, both CTEs are dropped and become unavailable for further usage in the code. The same result can be obtained with window functions and just one CTE, but we used multiple common table expressions here. This is a good illustration of how two CTEs can be used in one query. You may have noticed that each CTE from our example reads a SQL table named job_offers. Of course, each CTE can read different tables from the database or even read another CTE defined in the same statement. That is, nesting is allowed when working with CTEs — one CTE can read or reference another CTE. The only requisite is that they must be defined in the same WITH query. Here is an example with nested CTEs: WITH avg_salary_per_role AS ( SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role ), min_avg_salary AS ( SELECT min(avg_salary) AS min_avg_salary FROM avg_salary_per_role ) SELECT * FROM avg_salary_per_role a INNER JOIN min_avg_salary b ON a.avg_salary = b.min_avg_salary; In this code: We defined two CTEs: avg_salary_per_role and min_avg_salary. The first CTE (avg_salary_per_role) calculates the average salary by role. The second CTE (min_avg_salary) reads the first CTE (avg_salary_per_role) and calculates the minimum of the role-level average salaries. The average salaries were calculated by the first CTE. The main query uses both CTEs in a JOIN statement and returns the role with the lowest average salary among all roles: roleavg_salarymin_avg_salary machine learning engineer6400064000 Once this SQL statement is executed and the result is displayed on the screen, both CTEs are dropped and are no longer available for further usage in the code. Table Creation vs. CTEs We have seen how using CTEs is a great alternative to subqueries. It is also a great alternative to creating a real table. Developers often create temporary tables in the database, use them in the next query, and then drop them afterwards. This approach can be replaced with one that uses CTEs. Why do I mention this? Before I started working with CTEs, I used to do exactly what I described above. Consider a query from our last example that looks like this: WITH avg_salary_per_role AS ( SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role ), min_avg_salary AS ( SELECT min(avg_salary) AS min_avg_salary FROM avg_salary_per_role ) SELECT * FROM avg_salary_per_role a INNER JOIN min_avg_salary b ON a.avg_salary = b.min_avg_salary; I would have written it like this: CREATE TABLE avg_salary_per_role AS SELECT role, avg(salary) AS avg_salary FROM job_offers GROUP BY role; CREATE TABLE min_avg_salary AS SELECT min(avg_salary) AS min_avg_salary FROM avg_salary_per_role; SELECT * FROM avg_salary_per_role a INNER JOIN min_avg_salary b ON a.avg_salary = b.min_avg_salary; DROP TABLE avg_salary_per_role; DROP TABLE min_avg_salary; Although both solutions will give you the same result, it is better to use CTEs so that you don’t need to worry about metadata?. You don’t need to remember table drops, because the SQL engine handles that for you with CTEs. Neat, right? Non-Recursive and Recursive CTEs So far, we have discussed the so-called “non-recursive” CTEs. However, we generally divide CTEs into two main types: Non-recursive CTEs Recursive CTEs A recursive CTE is a special form of nested CTEs. A recursive CTE is one that references itself within the same common table expression. Here is what a recursive CTE looks like: WITH recursive_cte (column_list) AS ( -----(in this query we are referencing recursive_cte)---- ) SELECT * FROM recursive_cte Recursive CTEs work on hierarchical data, so the data must be properly defined. The code is run until all levels of hierarchy have been exhausted. I remember when I wrote my first recursive CTE. I needed to extract basic information about the organizational structure in a bank. A recursive CTE is appropriate here, because the organizational structure in a bank is hierarchical (i.e., has a tree structure): There is a main unit to which all units belong (the so-called “anchor”). In a bank, this is the supervisory board. There are organizational units for which a clear hierarchy is defined. For example, the data analytics group is under the IT department along with other units, the IT department is a part of the business support unit, and the business support unit is under the supervisory board which is the main unit (anchor). Since I needed the number of employees for each unit (anchor, parent, child), the recursive CTE was the way to go. I remember how happy I was when my script worked—it had only one SQL statement! I could have written a separate query for each department for which I needed the information. Since I used a recursive CTE, however, I got the results without having to write so many queries. We won’t dive deep into recursive CTEs here; for more details, check out our interactive online course on LearnSQL.com that covers this topic. If you want to see CTEs in a real business example, I suggest Creating Basic SQL Reports—there, the multi-level aggregation, often required in business reports, is demonstrated with a use of CTEs. In addition, there are several great articles on various CTE topics with examples; you can find them here and here. Code Readability and CTEs To summarize, here are some reasons why CTEs are important, and why and when you should use CTEs: To avoid subqueries in SQL statements. This specifically refers to situations in which we are joining several tables in a main query and one or more of those tables is a subquery. In this case, a CTE is a great alternative that will improve readability of your code. To avoid unnecessarily creating tables and views in your database. CTEs help avoid unnecessary metadata. It is common to create tables to be used in joins so that you don’t have to write subqueries. However, rather than generating a “real” table whose metadata is stored in a database, you can use a CTE as an alternative. Since it is a temporary named result set, it is not stored in a database or used later in your code, and your code will be just as readable as the code that creates real tables. To make your code easier to understand and maintain. CTEs are particularly useful in long queries. They are a very effective way for maintaining more complicated queries. Each CTE has a name, and intuitive names can help tremendously with the readability and the maintenance of the code. It is much easier to figure out what is going on in the code with intuitive names for code chunks, since you can understand quickly what the code chunk is about by its name. We have walked through the basic syntax and several examples of how and when you can define and use CTEs. To understand better and to practice, I recommend a course from LearnSQL.com named Recursive Queries. In this course, we interactively teach you how to use CTEs (recursive and non-recursive) in your day-to-day work. You can also find more CTEs in practice in an interactive course named Creating Basic SQL reports, in which you can learn how to use CTEs in multi-level aggregations (like creating business reports). Once you master the CTE, you will be pleased with how much better your SQL scripts look! Tags: sql learn sql CTE