4th Sep 2017 6 minutes read

How to Track Down Duplicate Values in a Table

Table of Contents

Why SQL?
SQL for finding duplicate value names
Candidate uniqueness
Complicated things can be solved by simple SQL queries
Summary

When it comes to information management, duplicates present one of the most common challenges to data quality. In this article, I'll explain how it is possible to find and distinguish duplicate names with the help of the SQL data programming language.

I really like my maiden name. The reason I like it so much is because it's rare. My maiden name (first with last) provided a unique identifier on platforms such as LinkedIn, Facebook, Twitter and similar. But then I got married. In addition to all the happy life changes, my marriage brought me a new and much more common last name. When I typed my married name into the LinkedIn search engine, 138 profile accounts were returned! I began to ask myself: "How will recruiters relate my CV to my LinkedIn profile data when my name is not unique? Will they recognize and distinguish duplicate names in the data? And if yes, then how?" It happens that SQL is an excellent tool for detecting duplicates, so let's take a look at the SQL syntax that can be used to solve just such a problem as this.

Why SQL?

You might be wondering: "Why SQL? Why not just search the web manually via Google search or a similar engine?" Primarily because SQL is much faster and more efficient. It's a standard computer language designed specifically for storing, manipulating and retrieving data from databases. And you'll find that SQL's significant performance advantage is matched by the ease of learning the language, thanks to its simple & understandable syntax.

SQL for finding duplicate value names

To set up our examples, let's assume that the LinkedIn table, linkedin_profile_accounts, corresponding to candidate profile accounts from a recruiter's database looks like this:

name	lastname	current_ job_position	profile _views	number_of _connections	date_of _profile_changes	e_mail
Marija	Ilic	BI specialist	30	368	20.08.2017.	ilicmar@gmail.com
Marija	Ilic	Attorney at Law at Kolsek, Markovic & Ilic Law Office	23	168	10.01.2017.	kolsek_law@KMI.be
Marija	Ilic	Senior Finance Manager	45	452	03.07.2017.	ilicsenior @consultancy.hr
Marija	Zoldin	C++ developer	20	354	25.02.2017.	zoldin@gmail.com
Ivano	Smith	Full stack JavaScript developer	68	358	25.08.2017.	smith123@gmail.com
Tim	Leep	Digital Marketing Specialist	98	489	20.08.2017.	timparty@yahoo.com
Tim	Leep	Head of Sales at Berg d.o.o.	51	265	20.08.2016.	timberg@yahoo.com

We can see that two names in the table are duplicated: Marija Ilic (three occurrences) and Tim Leep (two occurrences).

The following SQL query will identify those values:

SELECT name,lastname,COUNT(*) AS duplicate_count
FROM linkedin_profile_accounts
GROUP BY name,lastname HAVING COUNT(*)>1;

Syntax is quite simple. Here is the explanation:

SELECT ... FROM statement – we are retrieving data from a specific database table. Table name is linkedin_profile_accounts.
name and lastname are chosen variables that will be used and viewed on the output/result set.
COUNT(*) indicates that we will count occurrences. Here it is used in combination with a GROUP BY statement. We obtain the number of occurrences per specific group. In this case we count occurrences for each name/lastname combination.
AS duplicate_count – AS defines the column name for the output result set. Three columns will be generated, and the third (which represents count per group) will be labeled duplicate_count.
GROUP BY statement is used in combination with COUNT function. It groups the result-set by two columns – name and lastname. This means we are counting the number of occurrences per each name/lastname combination in the table
HAVING COUNT(*)>1 – only duplicated records (names which are not unique) will appear in the output/result set.

Output:

name	lastname	duplicate_count
Marija	Ilic	3
Tim	Leep	2

The result set identifies duplicate names Marija Ilic and Tim Leep from the database table.
Be aware of the fact that HAVING COUNT(*)>1 statement excludes unique names from a result data set. If you remove HAVING COUNT(*)>1 statement then all the names will be included.

Code without HAVING COUNT(*)>1:

SELECT name,lastname,COUNT(*) as duplicate_count
FROM linkedin_profile_accounts
GROUP BY name,lastname;

Output will look like this:

name	lastname	duplicate_count
Marija	Ilic	3
Marija	Zoldin	1
Ivano	Smith	1
Tim	Leep	2

All names are printed with their corresponding counts. Two names from the table are unique, and two have duplicate values. As we can see now, a recruiter will need additional information in order to uniquely identify a candidate and select the appropriate LinkedIn profile. Fortunately, the email address is also present in the profile account table, so perhaps this additional variable can be used to achieve a unique identifier for each name in the table.

Candidate uniqueness

Will the email address give us uniqueness? The following SQL will answer that question:

SELECT name,lastname,e_mail, COUNT(*) AS duplicate_count
FROM linkedin_profile_accounts
GROUP BY name,lastname,e_mail HAVING COUNT (*)>1;

Output now will look like this:

0 rows selected

No names are presented in the result set, because the name, lastname, and e_mail variables, together in combination, provide a unique identifier for every person in the linkedin_profile_account table (all return count=1).

If we remove HAVING COUNT(*)>1 all observations will be presented in the result set, because each row/person in the linkedin_profile_accounts table has a different email account.

The code and output in that case will look like this:

SELECT
	name,
	lastname,
	e_mail,
	COUNT(*) AS duplicate_count
FROM linkedin_profile_accounts
GROUP BY name, lastname, e_mail;

Output:

name	lastname	e_mail	duplicate_count
Marija	Ilic	ilicmar@gmail.com	1
Marija	Ilic	kolsek_law@KMI.be	1
Marija	Ilic	ilicsenior@consultancy.hr	1
Marija	Zoldin	zoldin@gmail.com	1
Ivano	Smith	smith123@gmail.com	1
Tim	Leep	timparty@yahoo.com	1
Tim	Leep	timberg@yahoo.com	1

It is clear, then, that a recruiter must use three attributes: name, lastname, and e_mail, to pull data from the LinkedIn database table in order to update his/her knowledge about potential candidates. With this combination he/she will get a better profile picture, so that smarter decisions can be made.

Complicated things can be solved by simple SQL queries

Despite the ease and syntactical simplicity of detecting duplicate data through the use of SQL, I have found duplicate data to be at the root of most of the data quality issues I have encountered. Simple SQL queries like the ones presented here often proved to be the answer. In most cases, the simplest code proves to be the most valuable. That is the beauty of SQL: it solves complicated problems with simple queries.

Summary

Many excellent interactive SQL courses are available on LearnSQL for people with no IT background. If you would like to learn and benefit from the advantages of SQL, let me encourage you to go ahead and give it a try. Leave your fears behind, turn a new page in your life, and begin your SQL adventure. It's an exciting and valuable journey that you just don't want to miss!

Tags:

Why SQL?

SQL for finding duplicate value names

Candidate uniqueness

Complicated things can be solved by simple SQL queries

Summary

You may also like