Back to articles list Articles Cookbook
5 minutes read

How to Solve Capitalization Data Quality Issues

Misspelled names, typos, and text data quality issues in your database? Power up your queries! Use SQL string functions to address data quality issues related to capitalization.

Sometimes, our SQL queries don't work as expected because of data quality issues. In this article, we will examine some string-related SQL functions that can correct data quality issues related to capitalization. We'll be using PostgreSQL in our examples, but similar functions are available in most database engines.

Why Standardize Strings in Your Database?

Often, the data quality of strings is not the best, especially in fields that are populated or loaded as free text. There is frequently some level of irregularity in the data entry (such as mistaken capitalization), so it's a good idea standardize string values.

Let's start by exploring some basic string functions related to capitalization.

SQL String Functions to the Rescue

Let me paint you a picture of a doctor who stores all his patient data in a single table. The table contains the patient's full name, the date of the visit, the doctor's diagnosis, the suggested treatment, and any prescribed drugs.

Every time a patient goes to his office, the doctor creates a new record. He manually types his notes into the database instead of automating it. As a consequence, the data quality suffers.

Full_Name date diagnostic treatment Drugs_in_receipt
John Smith '03/04/2016′ Flu rest "aspirin, paracetamol"
Mary Deep '10/11/2016′ Food Poisoning Rest & diet "penicillin,paracetamol"
Agnes Jason '03/12/2016′ flu rest "aspirin, paracetamol"
Johnny SMITH '15/03/2017′ Food Poisoned Rest & diet "penicillin,paracetamol"
AGNES Jason '19/072017′ angina Rest and don't speak "amoxicillin"
Peter Duckerz '10/10/2017′ Flu Rest "paracetamol"

Same Patient, Different Name

Our doctor is not a good typist. He misspells words, uses uppercase letters randomly, and makes a few typos. He's particularly weak with remembering names, so the same patient could have more than one name in his database. When the doctor needs all the records related to one patient, he may retrieve the wrong ones. This is a serious problem, but we can fix it with SQL string functions.

Let's try the following query to analyze what's happening:

SELECT * FROM patient_data WHERE full_name = 'Agnes Jason'
Full_Name date diagnostic treatment Drugs_in_receipt
Agnes Jason '03/12/2016′ flu rest "aspirin, paracetamol"

This gives us one result for "Agnes Jason". However, if we look at the table, we can see there is another record for an "AGNES Jason". Since we want both records, we must change the query. By adding the LOWER function, which essentially converts a string to all-lowercase letters, we can eliminate the good doctor's inconsistent capitalization. Here is the new query:

SELECT * FROM patient_data WHERE lower(full_name) = 'agnes jason'
Full_Name date diagnostic treatment Drugs_in_receipt
Agnes Jason '03/12/2016′ flu rest "aspirin, paracetamol"
AGNES Jason '19/072017′ angina Rest and don't speak "amoxicillin"

The previous approach worked for Agnes' records, but it won't work for John Smith. That's because "John Smith" has a different name in another record: "Johnny Smith".

Here's the same query for John Smith:

SELECT * FROM patient_data WHERE lower(full_name) = 'john smith'
Full_Name date diagnostic treatment Drugs_in_receipt
John Smith '03/04/2016′ Flu rest "aspirin, paracetamol"

To solve this problem, we'll use the LIKE operator and a wildcard character (%). We'll put them in the WHERE clause. The idea is to obtain all records with a full_name ending in 'smith' (remember to use the lower function here).

Here's the new query:

SELECT * FROM patient_data WHERE lower(full_name) LIKE '%smith'
Full_Name date diagnostic treatment Drugs_in_receipt
John Smith '03/04/2016′ Flu rest "aspirin, paracetamol"
Johnny SMITH '15/03/2017′ Food Poisoned Rest & diet "penicillin,paracetamol"

If you're not familiar with SQL wildcards, have a look at the following table before we proceed:

Wildcard Meaning
% Matches any number of characters. Can be used at the beginning, middle, or end of a string.
_ Matches exactly one character. Can also be used anywhere in a string.

Basically, we used the % wildcard to tell SQL to find any "full_name" record that has "smith" at the end of it. It doesn't matter what (or how many) characters come before "smith".

Let's see another example of the LIKE operator at work. Say you want to obtain every record for a patient with the last name of "Duckerz" but you don't remember exactly how the name spelled. Is the last letter an "s" or a "z"? Let's use a wildcard in a query to solve this issue:

SELECT * FROM patient_data WHERE lower(full_name) LIKE '%ducker_'

We are using two wildcards: the % wildcard to match the patient's first name and the _ wildcard to match exactly one character after "Ducker". This is what we get:

Full_Name date diagnostic treatment Drugs_in_receipt
Peter Duckerz '10/10/2017′ Flu Rest "paracetamol"

The opposite of the SQL lower function is the upper function. In fact, we could've used the SQL upper function instead of lower in all the example queries we covered so far in this article. For example, if we replaced SQL lower by SQL upper in the first query we executed, we would have gotten the following query, and as you can tell, the results would've been exactly the same.

SELECT * FROM patient_data WHERE upper(full_name) = 'agnes jason'
Full_Name date diagnostic treatment Drugs_in_receipt
Agnes Jason '03/12/2016′ flu rest "aspirin, paracetamol"
AGNES Jason '19/072017′ angina Rest and don't speak "amoxicillin"

The role of the SQL initcap function is to improve readability by only capitalizing the initial letter of each word in a string (hence the name). This function receive a string argument (that can contain several words) and converts the first letter of every word to uppercase and the rest of the letters to lowercase. Let's see an example:

SELECT full_name,initcap(full_name) as Beauty_Name FROM patient_data
Full_Name Beauty_Name
John Smith John Smith
Mary Deep Mary Deep
Agnes Jason Agnes Jason
Johnny SMITH Johnny Smith
AGNES Jason Agnes Jason
Peter Duckerz Peter Duckerz

Learn More About SQL String Functions

We've covered the most frequently used SQL functions related to capitalization, such as SQL lower, SQL upper, and SQL initcap.

That's a good start, but there is a lot more to learn. Check out the LearnSQL.com Standard SQL Functions course to sharpen your skills. Try it for free today!