21st Sep 2017 10 minutes read An Illustrated Guide to the SQL Self Join Dorota Wdzięczna JOIN SQL basics SQL clauses Table of Contents When Should We Use Self Join in SQL? Hierarchical Data Sequential Data Graphs Using SQL Self Join to Find Duplicate Values Learn More What is an SQL self join and how does it work? When should it be used? We’ll provide answers to those questions! In SQL, we can combine data from multiple tables by using a JOIN operator. JOIN has several variants; we’ve already discussed CROSS JOIN, INNER JOIN, and OUTER JOIN. Most of the time, these operators join data from two or more different tables. You can practice all the different types of JOINs in our interactive SQL JOINs course. In this article, however, we will explain how to join records from the same table. A SQL self join joins data from the same table. In other words, it joins a table with itself. Records taken from the table are matched to other records from the same table. Why would you do this? You may need to compare a value with another value from the same row. You can’t do that unless you join the table to itself and compare the values as if they were in two separate records. There is no dedicated operator for this type of query. Instead, the SQL self join uses the INNER JOIN operator, one of the OUTER JOIN operators, or a CROSS JOIN. The difference is that a single table is listed as both the left and right table in the join. Let’s look at a SQL self join example based on the employee table, shown below: id first_name last_name salary manager_id 1 John Watson 7550 NULL 2 Anne Brown 3500 1 3 James Black 3000 1 4 Scarlett Miller 2500 3 5 Ethan Davis 1200 3 6 Jacob Smith 2000 3 The employee table stores employees’ ID numbers, first names, last names, salaries, and their manager’s ID number. With the exception of the boss (whose ID = 1), everyone has a supervisor. Only the boss can have a NULL value in the manager_id column. Let’s try a query that returns the immediate superior of each employee: SELECT e.id, e.first_name, e.last_name, e.salary, m.first_name AS fname_boss, m.last_name AS lname_boss FROM employee e JOIN employee m ON e.manager_id = m.id; Note that we used the employee table as both the left and right table in the join. To join data from the same table, we had to assign two aliases to the table name. You must use aliases when performing self joins. Because you’re joining columns from the same table, they will have the same names. The alias renames the columns so the database engine can execute your query. In the query above, we used the letter e as the alias for the left table. In this case, we are using the employee table for its employee records. We used the letter m as the alias for the right table, which used the employee table for its manager records. Even though we are using one table, SQL treats it as two different tables. We selected employee data by selecting the id, first_name and last_name columns in the e “table”. The managers’ data was selected by referring to the m “table”. Notice that the employee records were prefixed with an “e” and the manager records were prefixed with an “m”. We could now join them because we used an appropriate joining condition: the manager_id column from the e table was compared to the id column from the m. Records from both tables were matched using a JOIN operator (or INNER JOIN; remember, the “INNER” is implied). Did you notice that the boss is not shown in the results? They have no match in the right “manager” table, so no data about the boss’s superior could be found. The image below shows how records are matched during this query execution. Let’s analyze the data in the output table: id first_name last_name salary fname_boss lname_boss 2 Anne Brown 3500 John Watson 3 James Black 3000 John Watson 4 Scarlett Miller 2500 James Black 5 Ethan Davis 1200 James Black 6 Jacob Smith 2000 James Black In our table, the JOIN did not return a record for John Watson. He does not have an identifier pointing to his superior’s id, so he is the boss. We would have to use a LEFT JOIN instead of an INNER JOIN if we wanted to return data for all employees, including the boss. In that case, records from the left table with no match in the right table would be shown as well: SELECT e.id, e.first_name, e.last_name, e.salary, m.first_name AS fname_boss, m.last_name AS lname_boss FROM employee e LEFT JOIN employee m ON e.manager_id = m.id; Our interactive SQL JOINs course has the whole section dedicated to practicing self joins. When Should We Use Self Join in SQL? Self joins are commonly used in the following areas: Hierarchical relationships Sequential relationships Graph data We’ll look at each of these individually. Hierarchical Data Processing hierarchical data is one of the most frequent applications of the SQL self join. This occurs when there is an additional column pointing to an identifier in the same table, such as in our employee table. In our case, the manager_id column refers to (has the same value as) the id column. The example given above (the employee-manager relationship) is not the only situation where hierarchical data is used. A bill of materials for a car shares a similar structure. Each car consists of multiple components, such as the engine, the brake system, and the electrical system. Each of these components consists of smaller parts. A car’s electrical system can be broken into its components, such as the battery and the alternator (which can be broken down into even more car parts). This means that car parts constitute a group of hierarchical data. The car table presented below shows some car parts data. id name element_id 1 car NULL 2 electrical system 1 3 engine 1 4 battery 2 5 alternator 2 Yet another example of hierarchical data is the parent-child relationship. By storing this relationship, we can use a single table to house an entire family tree. We can then use a SQL self join to easily retrieve data about the ancestors of a given person. The table below can help us quickly identify the eldest ancestor(s) in a family. These people do not have ancestor data in their records, which means they form the root of the family tree. The person table presented below illustrates this: id first_name last_name birth mother_id father_id 1 John Watson 1945 NULL NULL 2 Anne Brown 1950 NULL NULL 6 Scarlett Miller 1985 2 1 7 Jacob Miller 1982 NULL NULL 8 David Miller 2015 6 7 The query below retrieves the first and last name of each person along with the first and last name of their mother and father. SELECT c.first_name, c.last_name, m.first_name AS fname_mother, m.last_name AS lname_mother, f.first_name AS fname_father, f.last_name AS lname_father FROM person c LEFT JOIN person m ON c.mather_id = m.id LEFT JOIN person f ON c.father_id = f.id; Note that we had to use a self join three times (i.e. on three “tables” or aliases) to get the data about both the father and the mother. John Watson, Anne Brown, and Jacob Miller do not have data pointing to their ancestors. The examples of hierarchical structures discussed above store data using a superior-inferior record approach. This enables us to present data as a tree structure. You’ll learn more about processing tree structures in SQL as you continue to learn about relational databases. Sequential Data Sequential data can also benefit from using a SQL self join. For example, suppose you have records that describe the consecutive steps required to prepare a dish. All of the steps can be placed in a single table. Their order is determined based on the columns that point to the IDs of the previous and next records in the same table. Let’s illustrate this using the instruction table: id content previous_id next_id 1 Preheat an oven to 220 degrees C. NULL 2 2 Peel four potatoes. 1 4 3 Toss sliced potatoes with oil. 4 6 4 Cut potatoes into slices. 2 3 5 Season the hot slices with salt and pepper. 6 NULL 6 Bake in the preheated oven for 20 minutes. 3 5 As you see, in the current order these instructions do not make sense. But when we use a SQL self join to show the sequential relationship for each step in the recipe, we get: Preheat an oven to 220 degrees C. Peel four potatoes. Cut potatoes into slices. Toss sliced potatoes with oil. Bake in the preheated oven for 20 minutes. Season the hot slices with salt and pepper. We also get some tasty oven fries! Graphs Self join SQL can also be used to show the relationships needed for graphs. A graph is a structure consisting of nodes connected to each other with edges (relations). One example of a graph is the road network between multiple cities. Take a look at the drawing below. This graph represents five cities that are connected with each other. Each arrow shows a road from one city to another. In this case, the cities are the nodes and the roads between them are the edges. We’re using two tables to store this data. The city table stores the ID number and the name of each city. The route table contains the route ID number, the starting city (the from_city_id column) and the target city (the to_city_id column). Here is the “city” table: id name 1 Laredo 2 San Antonio 3 Austin 4 Waco 5 Houston And this is the “route” table: id from_city_id to_city_id 1 4 1 2 4 3 3 4 2 4 1 4 5 2 3 6 2 5 7 5 3 We can use a SQL self join on the city table, along with an INNER JOIN of the two tables, to find out what routes exist between cities. Take a look at the query. SELECT c1.name AS from_city, c2.name AS to_city FROM city c1 JOIN route r ON c1.id = r.from_city_id JOIN city c2 ON c2.id = r.to_city_id; The city and route tables were joined using the id column from city and the from_city_id column from route. At this point, we could only retrieve the name of the start city. In order to retrieve the name of the target city, we used a self-join on the city table. This time, we compared the id from the aliased city table with the to_city_id column in the route table. Here is the result: from_city to_city Waco Laredo Waco Austin Waco San Antonio Laredo Waco San Antonio Austin San Antonio Houston Houston Austin You can also use a SQL self join in recursive subqueries that store graphs. More information about this topic can be found in our new course, “Recursive Queries”. Using SQL Self Join to Find Duplicate Values Self-joins can also be used to identify duplicate values in a table. Let’s introduce an example table called color: id name 1 blue 2 green 3 yellow 4 blue 5 yellow Each record in the table is different because of the id column, which must always be unique. But this doesn’t prevent two rows from storing the same color name. We want to identify such cases and find the IDs of the duplicate color names. Let’s try this: SELECT c1.id AS id1, c1.name AS color1, c2.id AS id2, c2.name AS color2 FROM color c1 JOIN color c2 ON c1.name = c2.name AND c1.id < c2.id; We were able to find duplicate color names because we self-joined records based on the color name. The second condition is used to skip identical records from both tables as well as the same pairs of records in reverse order. Take a look at the query result: id1 color1 id2 color2 1 blue 4 blue 3 yellow 5 yellow It is now easy to see that there are duplicate values for blue and yellow. Learn More If you would like to learn more about JOINs than we covered in this illustrated SQL self join guide, check out our SQL JOINs course. Remember, the best way to learn SQL JOINs – or anything else – is to learn by doing! Tags: JOIN SQL basics SQL clauses