4th Sep 2017 6 minutes read How to Track Down Duplicate Values in a Table Marija Ilic JOIN SQL basics SQL clauses Table of Contents Why SQL? SQL for finding duplicate value names Candidate uniqueness Complicated things can be solved by simple SQL queries Summary When it comes to information management, duplicates present one of the most common challenges to data quality. In this article, I'll explain how it is possible to find and distinguish duplicate names with the help of the SQL data programming language. I really like my maiden name. The reason I like it so much is because it's rare. My maiden name (first with last) provided a unique identifier on platforms such as LinkedIn, Facebook, Twitter and similar. But then I got married. In addition to all the happy life changes, my marriage brought me a new and much more common last name. When I typed my married name into the LinkedIn search engine, 138 profile accounts were returned! I began to ask myself: "How will recruiters relate my CV to my LinkedIn profile data when my name is not unique? Will they recognize and distinguish duplicate names in the data? And if yes, then how?" It happens that SQL is an excellent tool for detecting duplicates, so let's take a look at the SQL syntax that can be used to solve just such a problem as this. Why SQL? You might be wondering: "Why SQL? Why not just search the web manually via Google search or a similar engine?" Primarily because SQL is much faster and more efficient. It's a standard computer language designed specifically for storing, manipulating and retrieving data from databases. And you'll find that SQL's significant performance advantage is matched by the ease of learning the language, thanks to its simple & understandable syntax. SQL for finding duplicate value names To set up our examples, let's assume that the LinkedIn table, linkedin_profile_accounts, corresponding to candidate profile accounts from a recruiter's database looks like this: name lastname current_ job_position profile _views number_of _connections date_of _profile_changes e_mail Marija Ilic BI specialist 30 368 20.08.2017. ilicmar@gmail.com Marija Ilic Attorney at Law at Kolsek, Markovic & Ilic Law Office 23 168 10.01.2017. kolsek_law@KMI.be Marija Ilic Senior Finance Manager 45 452 03.07.2017. ilicsenior @consultancy.hr Marija Zoldin C++ developer 20 354 25.02.2017. zoldin@gmail.com Ivano Smith Full stack JavaScript developer 68 358 25.08.2017. smith123@gmail.com Tim Leep Digital Marketing Specialist 98 489 20.08.2017. timparty@yahoo.com Tim Leep Head of Sales at Berg d.o.o. 51 265 20.08.2016. timberg@yahoo.com We can see that two names in the table are duplicated: Marija Ilic (three occurrences) and Tim Leep (two occurrences). The following SQL query will identify those values: SELECT name,lastname,COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname HAVING COUNT(*)>1; Syntax is quite simple. Here is the explanation: SELECT ... FROM statement – we are retrieving data from a specific database table. Table name is linkedin_profile_accounts. name and lastname are chosen variables that will be used and viewed on the output/result set. COUNT(*) indicates that we will count occurrences. Here it is used in combination with a GROUP BY statement. We obtain the number of occurrences per specific group. In this case we count occurrences for each name/lastname combination. AS duplicate_count – AS defines the column name for the output result set. Three columns will be generated, and the third (which represents count per group) will be labeled duplicate_count. GROUP BY statement is used in combination with COUNT function. It groups the result-set by two columns – name and lastname. This means we are counting the number of occurrences per each name/lastname combination in the table HAVING COUNT(*)>1 – only duplicated records (names which are not unique) will appear in the output/result set. Output: name lastname duplicate_count Marija Ilic 3 Tim Leep 2 The result set identifies duplicate names Marija Ilic and Tim Leep from the database table. Be aware of the fact that HAVING COUNT(*)>1 statement excludes unique names from a result data set. If you remove HAVING COUNT(*)>1 statement then all the names will be included. Code without HAVING COUNT(*)>1: SELECT name,lastname,COUNT(*) as duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname; Output will look like this: name lastname duplicate_count Marija Ilic 3 Marija Zoldin 1 Ivano Smith 1 Tim Leep 2 All names are printed with their corresponding counts. Two names from the table are unique, and two have duplicate values. As we can see now, a recruiter will need additional information in order to uniquely identify a candidate and select the appropriate LinkedIn profile. Fortunately, the email address is also present in the profile account table, so perhaps this additional variable can be used to achieve a unique identifier for each name in the table. Candidate uniqueness Will the email address give us uniqueness? The following SQL will answer that question: SELECT name,lastname,e_mail, COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname,e_mail HAVING COUNT (*)>1; Output now will look like this: 0 rows selected No names are presented in the result set, because the name, lastname, and e_mail variables, together in combination, provide a unique identifier for every person in the linkedin_profile_account table (all return count=1). If we remove HAVING COUNT(*)>1 all observations will be presented in the result set, because each row/person in the linkedin_profile_accounts table has a different email account. The code and output in that case will look like this: SELECT name, lastname, e_mail, COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY name, lastname, e_mail; Output: name lastname e_mail duplicate_count Marija Ilic ilicmar@gmail.com 1 Marija Ilic kolsek_law@KMI.be 1 Marija Ilic ilicsenior@consultancy.hr 1 Marija Zoldin zoldin@gmail.com 1 Ivano Smith smith123@gmail.com 1 Tim Leep timparty@yahoo.com 1 Tim Leep timberg@yahoo.com 1 It is clear, then, that a recruiter must use three attributes: name, lastname, and e_mail, to pull data from the LinkedIn database table in order to update his/her knowledge about potential candidates. With this combination he/she will get a better profile picture, so that smarter decisions can be made. Complicated things can be solved by simple SQL queries Despite the ease and syntactical simplicity of detecting duplicate data through the use of SQL, I have found duplicate data to be at the root of most of the data quality issues I have encountered. Simple SQL queries like the ones presented here often proved to be the answer. In most cases, the simplest code proves to be the most valuable. That is the beauty of SQL: it solves complicated problems with simple queries. Summary Many excellent interactive SQL courses are available on LearnSQL for people with no IT background. If you would like to learn and benefit from the advantages of SQL, let me encourage you to go ahead and give it a try. Leave your fears behind, turn a new page in your life, and begin your SQL adventure. It's an exciting and valuable journey that you just don't want to miss! Tags: JOIN SQL basics SQL clauses