|
I used to advise people not to use subqueries if they can help it.
You may recall that in my last tip, I posted the following query for finding duplicate Student Card
numbers:
SELECT t1.* FROM Student AS t1
WHERE t1.StudentID>(SELECT MIN(StudentID)
FROM Student AS t2
WHERE t2.StudentCard=t1.StudentCard)
The same results are returned by the following query, which does not use a subquery:
SELECT t1.StudentID, t1.StudentCard,
t1.Surname, t1.GivenName
FROM Student AS t1
LEFT JOIN Student AS t2
ON t1.StudentCard=t2.StudentCard
WHERE t1.StudentID>t2.StudentID
GROUP BY t1.StudentID, t1.StudentCard,
t1.Surname, t1.GivenName
So which is better?
If by "better" you mean easier to understand, I'll pick the subquery every time.
If by "better" you mean "faster", it's a tough call.
Without any indexes on this table other than the primary key, the first query gets bogged down with
only a small number of records.
I tried both queries with 1000 students. Using a subquery, my dodgy old computer took more than
30 seconds to find the duplicates. The second query executed almost instantly.
After adding an index to the duplicated field, it was a different story, with both queries running
neck and neck, until I tried them with 50 000 students.
At that point, the subquery was faster:
1 minute versus 1 minute and 15 seconds.
There are two lessons to be drawn from this little exercise:
1. Test any received wisdom against your own data.
You might find that *your* database is the exception that proves the rule. Which query is faster will
depend a lot on *your* data.
For instance, the test I described above, was done using a table where a high percentage of records
were duplicates. Would I have got the same results if there were only a handful of duplicates out of
the 50 000 records?
2. Subqueries aren't so bad after all... as long as you have the right indexes.
|