8th May 2018 5 minutes read Converting Subqueries to Joins Ignacio L. Bisso data analysis how to JOIN SQL basics subqueries Table of Contents When should I use SQL subqueries? The data Example: replacing a subquery with a JOIN Example: when subqueries are the only way to go Example: when JOINs and subqueries are equally efficient Try it yourself! Not all queries are alike, especially in terms of performance. In this article, we'll look at how you can convert SQL subqueries to joins for improved efficiency. When should I use SQL subqueries? Great question! Unfortunately, there's no concrete answer. SQL beginners tend to overuse subqueries. Typically, once they find that SQL construction works in one situation, they try to apply that same approach to other situations. It's only natural. However, in some cases a subquery can be replaced with a more efficient JOIN. If you can avoid a subquery and replace it with a JOIN clause, you should do so without hesitation. But of course, in some cases, using a subquery is the only way to solve a data question. In this article, I'll show examples of both cases: when a subquery is a must and when a subquery should be avoided and replaced by a JOIN. The data Before getting to the examples, let’s briefly look at the sample database we will use. The database has two tables representing the production statistics of a fictional apple farm company named EverRed. The company has three farms. The first table is current_year_production, which contains information about the number of apples produced in the current year by each farm, as well as the area and number of trees on each farm. The second table, production_history, stores past production information for each farm. Below are some sample data from these two tables. current_year_production farm_idarea_m2farm_namenumber_of_treesproduction_in_kg 10010000The Paradise2404400 10115000Evergreen3006200 10220000Red Delicious5809300 production_history farm_idyearproduction_in_kgprice_ton 100201741001200 101201758001200 102201794001200 100201639001300 101201664001300 102201691001300 Example: replacing a subquery with a JOIN Suppose you're an SQL data analyst working at EverRed. The owner of the company wants you to obtain the names of the farms where the company is producing more apples in the current year than in the previous year (2017). Solution using a subquery: SELECT farm_name, FROM current_year_production CYP WHERE production_in_kg > ( SELECT production_in_kg FROM production_history PH WHERE PH.farm_id = CYP.farm_id AND year = 2017 ) Solution using a JOIN clause: SELECT farm_name, FROM current_year_production CYP JOIN production_history PH ON PH.farm_id = CYP.farm_id WHERE PH.year = 2017 AND CYP.production_in_kg > PH.production_in_kg The difference between these two approaches is in performance. While the JOIN clause in the second example needs to be executed only once, the subquery in the first example will be executed once per farm. In this case, we only have three farms, so the difference is negligible. But what if you worked for a larger company that has 10,000 global farms? The subquery would need to be executed 10,000 times. Clearly, a subquery is inefficient for our purposes here. Moreover, in a test database with only a few farms, both queries execute with an acceptable response time; however, when we move to a productive database, (where the data volume is usually much higher), the response time of the subquery approach will increase significantly, while response time of JOIN approach will remain stable. The result of both previous equivalent queries is: farm_name The Paradise Evergreen Example: when subqueries are the only way to go Let’s now suppose the owner of the company, after reading the results you delivered in the previous query, asks you to obtain the names of the farms that are producing more apples per square meter this year than the historical average. This sounds complex, but it's easier than it seems. Solution using a subquery: SELECT farm_name, production_in_kg / area AS "production_per_meter" FROM Current_year_production WHERE production_in_kg / area > ( SELECT AVG(PH.production_in_kg / CYP.area) FROM production_history PH JOIN Current_year_production CYP ON PH.farm_id = CYP.farm_id ) We can't replace this subquery with a JOIN because we don’t have a table with the average previously calculated. In other words, we need to first calculate the historical average. And to do that, we need a GROUP BY, which can break the one-to-one relation needed for a JOIN. Another point to note is that the metric "apples per square meter" is obtained with the following expression: production_in_kg / area We used the "apples per square meter" metric because we need some way to compare the productivity of the different farms and rank them. The total "production_in_kg" of a farm is not a comparable value—because it's likely, for instance, that the biggest farm will have a better production_in_kg. So we divide "production_in_kg" by the area of each farm to standardize the values and create a comparable metric. We find that the historical average production per square meter is 0.42. Then the result of the previous query is: farm_nameproduction_per_meter The Paradise0.44 Red Delicious0.47 Example: when JOINs and subqueries are equally efficient As a last data question, let’s try to obtain the years when the company produced fewer apples than in the current year. We can write this query using two different approaches. Solution using a subquery: SELECT year, sum(production_in_kg) FROM production_history PH GROUP BY year HAVING sum(production_in_kg) < ( SELECT sum(production_in_kg) FROM current_year_production ) Solution using a JOIN clause: SELECT year, sum(PH.production_in_kg) FROM production_history PH JOIN current_year_production CYP ON PH.farm_id = CYP.farm_id GROUP BY year HAVING sum(PH.production_in_kg) < sum(CYP.production_in_kg) You can see that both of these queries are really similar; the main difference is in the JOIN clause and the subquery. In this case, both queries are equally efficient—the subquery is executed one time in the HAVING clause, so there is no performance issue. Try it yourself! To wrap things up, it's important to note that subqueries and JOINs are both really important resources for an SQL developer. We saw examples where we can replace a subquery with a JOIN and examples where we cannot do such a replacement. And sometimes, subqueries and JOINs are equally efficient. But how do you know when to use a subquery and when to use a JOIN? In all honesty, the only way to develop your intuition is to regularly solve SQL exercises. If you're looking to sharpen your SQL skills, our SQL Practice Set offers 88 practice problems for a comprehensive review. Stay tuned for more articles! Tags: data analysis how to JOIN SQL basics subqueries