If you are using MS Access, finding duplicates is easy. Just use the "Find duplicates" query wizard. If your database is
in some other RDBMS, just use Access to create a database
with the same structure, and then use the "Find duplicates" query wizard to build your query, and then peek at the SQL
that the wizard generates and copy and paste it into whatever RDBMS you are using.
"Hey, I didn't sign up to this mailing list to be told how to use a wizard!"
The thing about the "Find duplicates" wizard is that the SQL that it generates is very "general purpose". In most
situations, you will have something that you can use as a primary key, such as an autonumber field, and the duplicated
information you are looking for is in the other fields
(possibly in a candidate key).
Suppose I have a table of student details, imaginatively called "Student".
| Table:
Student |
| Field
# |
Name |
Type |
| Field1: |
StudentCard |
Text(7) |
| Field2: |
Surname |
Text (50) |
| Field3: |
GivenName |
Text (50) |
Suppose furthermore, that there are some duplicate Student Card numbers. Sample data follows:
| StudentCard |
Surname |
GivenName |
| TS00001 |
Blogs |
Jane |
| TS00002 |
Doe |
Joe |
| TS00003 |
Payne |
Thomas |
| TS00002 |
Doe |
Joe |
This situation is a perfect job for the "Find duplicates" wizard, because there is nothing to distinguish rows 2 and
4 above. The SQL generated by the "Find duplicates" wizard groups on the fields specified in "look for duplicate
information in", and then counts the number of rows in each group, and then displays the rows from those groups which
have two or more rows.
Using the above sample data, you would get the following:
| StudentCard |
Surname |
GivenName |
| TS00002 |
Doe |
Joe |
| TS00002 |
Doe |
Joe |
The next question is: what are you going to do with that data? If there are only a handful of duplicates in the
table, then it isn't such a chore to delete those duplicates manually. But what if you have a 1000 duplicates?
At the very least, you need some way of distinguishing the duplicate rows from each other. To do that, add an identity
or autonumber field to the table.
Now the sample data looks like this:
| StudentCard |
Surname |
GivenName |
StudentID |
| TS00001 |
Blogs |
Jane |
1 |
| TS00002 |
Doe |
Joe |
2 |
| TS00003 |
Payne |
Thomas |
3 |
| TS00002 |
Doe |
Joe |
4 |
And the output from the "find duplicates" query wizard looks like:
| StudentCard |
Surname |
GivenName |
StudentID |
| TS00002 |
Doe |
Joe |
2 |
| TS00002 |
Doe |
Joe |
4 |
I can retrieve the same rows using the following query:
SELECT t1.*
FROM Student AS t1 INNER JOIN Student AS t2
ON t1.StudentCard=t2.StudentCard
WHERE t1.StudentID<>t2.StudentID
You might find it easier to understand what is going on by considering the following:
SELECT t1.*, t2.*
FROM Student AS t1 INNER JOIN Student AS t2
ON t1.StudentCard=t2.StudentCard
| t1.* |
t2.* |
| TS00001 |
Bloggs |
Jane |
1 |
TS00001 |
Bloggs |
Jane |
1 |
| TS00002 |
Doe |
Joe |
2 |
TS00002 |
Doe |
Joe |
4 |
| TS00002 |
Doe |
Joe |
2 |
TS00002 |
Doe |
Joe |
2 |
| TS00002 |
Doe |
Joe |
4 |
TS00002 |
Doe |
Joe |
4 |
| TS00002 |
Doe |
Joe |
4 |
TS00002 |
Doe |
Joe |
1 |
Now back to the problem of what to *do* with those pesky duplicates...
A quick and dirty solution to the problem would be to assume that the first row with a given StudentCard number contains
the correct data, and to delete any rows which have the same
StudentCard and have a higher StudentID.
I always like to write a SELECT query first, so that I know exactly which rows I'm about to delete:
SELECT t1.* FROM Student AS t1
WHERE t1.StudentID>(SELECT MIN(StudentID)
FROM Student AS t2
WHERE t2.StudentCard=t1.StudentCard)
Just change "SELECT t1.* FROM" to "DELETE FROM" to turn that into a DELETE query.
Now I hate to rain on my own parade, but suppose there is a fifth record in the Student table:
StudentCard Surname GivenName StudentID
TS00001 Bloggs Jane 1
TS00002 Doe Joe 2
TS00003 Payne Thomas 3
TS00002 Doe Joe 4
TS00003 Pain Thomas 5
We want to delete the 4th record, but really, records 3 and 5 should be left alone to be manually processed later:
DELETE FROM Student AS t1
WHERE t1.StudentID>(SELECT MIN(StudentID)
FROM Student AS t2
WHERE t2.StudentCard=t1.StudentCard
AND t2.Surname=t1.Surname
AND t2.GivenName=t1.GivenName)
|