Removing Duplicate Records using SQL
January 25th, 2008 | by ndunn@webucator.com |Yesterday I was working with live data and did something very stupid that I was afraid was going to be a nightmare to fix, but thanks to Google, I found a quick solution.
I was writing a stored procedure in SQL Server that would make it easy to make an exact copy of course in our database. We often customize our courses. The obvious way to do this is to copy the course information and then modify it, but our intranet isn’t yet set up for that, hence the need for the stored procedure.
Well, part of the stored procedure involved making sure that the new course was in all the same categories as the original course. The relevant tables are below:

So, as part of the stored procedure, I wrote a SQL statement that looked like this:
INSERT INTO jnCrsCats SELECT @NewID, CategoryID, MainCategory FROM jnCrsCats WHERE CourseID = @OldID
Then, I decided to do a quick test of the SELECT to make sure it returned what I wanted. To do so, I needed to change @NewID to CourseID and then select the middle two lines and (in SQL Server Management Studion) press F5 to execute. Easy right? Stop reading for a second and look back at the query and see if you can guess what disaster lurked…
I accidentally highlighted the first three lines, which gave me this:
INSERT INTO jnCrsCats SELECT CourseID, CategoryID, MainCategory FROM jnCrsCats
And guess what. There was no constraint on the table that prevented me from adding duplicate records. There is now! I now have a composite primary key of CourseID and CategoryID, which will prevent this from happening again, but I couldn’t do that until I’d cleaned up the mess I had created. And this was a big mess. All of our courses were now being listed on our website in duplicate and we were getting a whole slew of application errors. I needed to fix this quickly, but couldn’t think of an easy way to do it. How was I going to uniquely identify the rows so that I could remove the duplicates? I googled and discovered quickly that I wasn’t the first person to run into this problem. Misery loves company, so that made me happier, but in itself wasn’t much help. I did some clicking and found this article, which lead me to a simple solution using a temporary table. Here’s what I did:
1. Create a temporary table from the rows in jnCrsCats grouping by all columns to remove duplicates:
SELECT CourseID, CategoryID, MainCategory INTO tempTable FROM jnCrsCats GROUP BY CourseID, CategoryID, MainCategory
2. Delete all rows from jnCrsCats:
DELETE FROM jnCrsCats
3. Insert all the ros from the temporary table into jnCrsCats:
INSERT INTO jnCrsCats SELECT CourseID, CategoryID, MainCategory FROM tempTable
Worked like a charm! Then I created a composite primary key on jnCrsCats so this wouldn’t happen again!
You must be logged in to post a comment.