Navicat Blog

December 21, 2017 by Robert Gravelle

One of the DBA's biggest annoyances is dealing with duplicate data. No matter how much we try to guard against it, duplicates always mange to find their way into our tables. Duplicate data is a big problem because it can affect application views (where each item is supposed to be unique), skew statistics, and, in severe cases, increase server overhead.

In this tip, we'll learn how to recognize duplicate data in MySQL, as well as how to delete them without removing precious valid data.

Duplicate Types

Most of the duplicate records that you'll encounter are one of two distinct types: Duplicate Meaning and Non-unique Keys. In this instalment we'll be dealing with Duplicate Meaning; we'll address Non-unique Keys in the next one.

When a Duplicate is not a Duplicate

Duplicate Meaning is the most common type of duplicate. It's a situation where two or more fields' contents are not the same, but their meaning is. You could think of it as a semantic duplicate.

Consider the following table excerpt:

movie_name media
---------------------------
ACADEMY DINOSAUR Theatre
ACE GOLDFINGER Television
ADAPTATION HOLES Theatre
AFFAIR PREJUDICE Theatre
AFRICAN EGG TV

In the media column, the entries "Television" and "TV" have the same connotation, but expressed differently. This issue is often caused by the use of free-text input where a limited dropdown would have been a better choice.

This type of duplication can be very challenging to deal with because you can't exclude duplicates using a SELECT DISTINCT.

There are two ways to deal with this problem:

  • Select data using REPLACE() to swap out values that we don't want with those that we want to see instead:
  • SELECT DISTINCT movie_name,
    REPLACE(media, "TV", "TELEVISION") as media,
    FROM films;

  • Update the actual table data. Here's a statement that updates all instances of “TV” with the preferred “TELEVISION” value:
  • UPDATE films
    SET media = REPLACE(media, "TV", "TELEVISION")
    WHERE media = "TV";

Here's a real-life example that I came across only a month ago!

Somehow, some unwanted curly apostrophes found their way into our data. Notice the “O'BRIEN” and “O'BRIEN” entries:

first_name last_name
---------------------
PENELOPE GUINESS
CONAN O'BRIEN
ED CHASE
JENNIFER DAVIS
CONAN O'BRIEN

We can deal with this problem in the same way we did above:

  • Select data using REPLACE() to swap out curly apostrophes with regular single quotes so that we're always dealing with the same character:
  • SELECT DISTINCT first_name,
    REPLACE(last_name, "'", "'") as last_name,
    FROM actors
    WHERE REPLACE(last_name, "'", "'") like "O'BRIEN";

  • Update the actual table data. This statement updates all apostrophes in the last_name column with regular single quotes:
  • UPDATE actors
    SET last_name = REPLACE(last_name, "'", "'")
    WHERE last_name like "%'%";

Conclusion

Duplicate records, doubles, redundant data, duplicate rows; whatever you want to call them, they are one of the biggest banes in a DBA's life. Nevertheless, it's crucial that you weed them out on a regular basis, lest you want to generate faulty statistics and confuse your users who interact with the database.

Navicat Blogs
Feed Entries
Blog Archives
Share