Navicat Blog

December 27, 2017 by Robert Gravelle

The majority of duplicate records fall into one of two categories: Duplicate Meaning and Non-unique Keys. The How to Spot and Delete Values with Duplicate Meaning in MySQL blog dealt with Duplicate Meaning; in today’s follow-up, we’ll address how to identify Non-unique Keys. That’s where two records in the same table have the same key, but may or may not have different values and meanings.

How Does this Happen?

Even a well-designed database can accumulate non-unique key duplicates. It often happens as a result of data that is imported from external sources such as text, csv, or excel files as well as data feeds. Even merging data from two different databases might create duplicate keys if you are combining each in some way to generate a new key – assuming of course that the new key column supports non-unique values. For example, concatenating two numbers to generate a new key could prove problematic:

Key 1     Key 2     New Key
--------------------------
10 25 1025
102 5 1025 !!!

An Example Table

In databases that support complex systems, it isn’t always feasible to prevent duplicate keys from occurring. What’s important is being able to deal with them quickly and effectively before they taint your data.

Let’s begin by separating the true duplicate values from overlapping keys.

Here’s the product of amalgamating two data sources of actors. You’ll notice that there a couple of duplicated names, specifically “JENNIFER DAVIS” and “NICK WAHLBERG”:

id      first_name      last_name
--------------------------------------
10 PENELOPE GUINESS
12 NICK WAHLBERG
14 ED CHASE
22 JENNIFER DAVIS
23 JOHNNY LOLLOBRIGIDA
27 BETTE NICHOLSON
34 GRACE MOSTEL
41 NICK WAHLBERG
39 JOE SWANK
23 CHRISTIAN GABLE
22 JENNIFER DAVIS

Nick Walberg would be an instance of Duplicate Meaning, which we explored in the last blog. JENNIFER DAVIS, on the other hand, appears in two records with the same key of 22. There is also a duplicated key that is associated with two unrelated actors: #23 for “JOHNNY LOLLOBRIGIDA” and “CHRISTIAN GABLE”. With regards to the duplicated keys of 22 and 23, the first is a true duplicate, whereas the second only needs a new key to be generated for one of the records.

Identifying and Counting Duplicates

The following query will identify all of the records of the above table that share a common id. I recommend using the MySQL group_concat() function to format duplicated rows together on one line:

SELECT
  COUNT(*) as repetitions,
  group_concat(id, ' (', last_name, ', ', first_name, ') ' SEPARATOR ' | ')
    as row_data
FROM amalgamated_actors
GROUP BY id
HAVING repetitions > 1;

Repetitions      row_data
-------------------------------------------------------------
2 22 (DAVIS, JENNIFER) | 22 (DAVIS, JENNIFER)
2 23 (LOLLOBRIGIDA, JOHNNY) | 23 (GABLE, CHRISTIAN)

If you ever wanted to find all duplicates - that is Duplicate Meaning and Non-unique Key duplicates - at the same time, you can combine the above query with one that checks for duplicated names using the UNION operator:

SELECT
  COUNT(*) as repetitions,
  group_concat(id, ' (', last_name, ', ', first_name, ') ' SEPARATOR ' | ')
    as row_data
FROM amalgamated_actors
GROUP BY id
HAVING repetitions > 1
UNION
SELECT
  COUNT(*) as repetitions,
  group_concat(id, ' (', last_name, ', ', first_name, ') ' SEPARATOR ' | ')
    as row_data
FROM amalgamated_actors
GROUP BY last_name, first_name
HAVING repetitions > 1;

That highlights all the duplicates in one result set:

Repetitions      row_data
-------------------------------------------------------------
2 22 (DAVIS, JENNIFER) | 22 (DAVIS, JENNIFER)
2 23 (LOLLOBRIGIDA, JOHNNY) | 23 (GABLE, CHRISTIAN)
2 41 (WAHLBERG, NICK) | 12 (WAHLBERG, NICK)

Conclusion

Crafting a query to identify duplicate keys in MySQL is relatively simple because you only need to group on the key field and include the “Having COUNT(*) > 1” clause. In a future article, we’ll review some different approaches for deleting duplicate rows and updating keys.

Navicat Blogs
Feed Entries
Share
Blog Archives