Finding And Deleting Duplicates

If you are using MS Access, finding duplicates is easy. Just use the "Find duplicates" query wizard. If your database is in some other RDBMS, just use Access to create a database
with the same structure, and then use the "Find duplicates" query wizard to build your query, and then peek at the SQL that the wizard generates and copy and paste it into whatever RDBMS you are using.

"Hey, I didn't sign up to this mailing list to be told how to use a wizard!"

The thing about the "Find duplicates" wizard is that the SQL that it generates is very "general purpose". In most situations, you will have something that you can use as a primary key, such as an autonumber field, and the duplicated information you are looking for is in the other fields
(possibly in a candidate key).

Suppose I have a table of student details, imaginatively called "Student".
Table:  Student
Field # Name Type
Field1: StudentCard Text(7)
Field2: Surname Text (50)
Field3: GivenName Text (50)

Suppose furthermore, that there are some duplicate Student Card numbers. Sample data follows:

StudentCard Surname GivenName
TS00001 Blogs Jane
TS00002 Doe Joe
TS00003 Payne Thomas
TS00002 Doe Joe

This situation is a perfect job for the "Find duplicates" wizard, because there is nothing to distinguish rows 2 and 4 above. The SQL generated by the "Find duplicates" wizard groups on the fields specified in "look for duplicate information in", and then counts the number of rows in each group, and then displays the rows from those groups which have two or more rows.

Using the above sample data, you would get the following:

StudentCard Surname GivenName
TS00002 Doe Joe
TS00002 Doe Joe

The next question is: what are you going to do with that data? If there are only a handful of duplicates in the table, then it isn't such a chore to delete those duplicates manually. But what if you have a 1000 duplicates?

At the very least, you need some way of distinguishing the duplicate rows from each other. To do that, add an identity or autonumber field to the table.

Now the sample data looks like this:

StudentCard Surname GivenName StudentID
TS00001 Blogs Jane 1
TS00002 Doe Joe 2
TS00003 Payne Thomas 3
TS00002 Doe Joe 4

And the output from the "find duplicates" query wizard looks like:

StudentCard Surname GivenName StudentID
TS00002 Doe Joe 2
TS00002 Doe Joe 4

I can retrieve the same rows using the following query:

SELECT t1.*
FROM Student AS t1 INNER JOIN Student AS t2
ON t1.StudentCard=t2.StudentCard
WHERE t1.StudentID<>t2.StudentID

You might find it easier to understand what is going on by considering the following:

SELECT t1.*, t2.*
FROM Student AS t1 INNER JOIN Student AS t2
ON t1.StudentCard=t2.StudentCard

t1.* t2.*
TS00001 Bloggs Jane 1 TS00001 Bloggs Jane 1
TS00002 Doe Joe 2 TS00002 Doe Joe 4
TS00002 Doe Joe 2 TS00002 Doe Joe 2
TS00002 Doe Joe 4 TS00002 Doe Joe 4
TS00002 Doe Joe 4 TS00002 Doe Joe 1

Now back to the problem of what to *do* with those pesky duplicates...

A quick and dirty solution to the problem would be to assume that the first row with a given StudentCard number contains the correct data, and to delete any rows which have the same
StudentCard and have a higher StudentID.

I always like to write a SELECT query first, so that I know exactly which rows I'm about to delete:

SELECT t1.* FROM Student AS t1
WHERE t1.StudentID>(SELECT MIN(StudentID)
FROM Student AS t2 
WHERE t2.StudentCard=t1.StudentCard)

Just change "SELECT t1.* FROM" to "DELETE FROM" to turn that into a DELETE query.

Now I hate to rain on my own parade, but suppose there is a fifth record in the Student table:

StudentCard Surname GivenName StudentID
TS00001 Bloggs Jane 1
TS00002 Doe Joe 2
TS00003 Payne Thomas 3
TS00002 Doe Joe 4
TS00003 Pain Thomas 5

We want to delete the 4th record, but really, records 3 and 5 should be left alone to be manually processed later:

DELETE FROM Student AS t1
WHERE t1.StudentID>(SELECT MIN(StudentID)
FROM Student AS t2
WHERE t2.StudentCard=t1.StudentCard
AND t2.Surname=t1.Surname
AND t2.GivenName=t1.GivenName)

Home About the Author Copyright © 2001 Adelle Hartley