Delete duplicate rows, do not preserve one row
- by Radley
I need a query that goes through each entry in a database, checks if a single value is duplicated elsewhere in the database, and if it is - deletes both entries (or all, if more than two).
Problem is the entries are URLs, up to 255 characters, with no way of identifying the row. Some existing answers on Stackoverflow do not work for me due to performance limitations, or they use uniqueid which obviously won't work when dealing with a string.
Long Version:
I have two databases containing URLs (and only URLs). One database has around 3,000 urls and the other around 1,000.
However, a large majority of the 1,000 urls were taken from the 3,000 url database. I need to merge the 1,000 into the 3,000 as new entries only.
For this, I made a third database with combined URLs from both tables, about 4,000 entries. I need to find all duplicate entries in this database and delete them (Both of them, without leaving either).
I have followed the query of a few examples on this site, but whenever I try to delete both entries it ends up deleting all the entries, or giving sql errors.
Alternatively:
I have two databases, each containing the separate database. I need to check each row from one database against the other to find any that aren't duplicates, and then add those to a third database.
Edit: I've got my own PHP solution which is pretty hacky, but works. I cannot answer my own question for 8 hours because I'm new, so here it is for now:
I went with a PHP script to accomplish this, as I'm more familiar with PHP than MySQL.
This generates a simple list of urls that only exist in the target
database, but not both. If you have more than 7,000 entries to parse
this may take awhile, and you will need to copy/paste the results
into a text file or expand the script to store them back into a
database.
I'm just doing it manually to save time.
Note: Uses MeekroDB
<pre>
<?php
require('meekrodb.2.1.class.php');
DB::$user = 'root';
DB::$password = '';
DB::$dbName = 'testdb';
$all = DB::query('SELECT * FROM old_urls LIMIT 7000');
foreach($all as $row) {
$test = DB::query('SELECT url FROM new_urls WHERE url=%s',
$row['url']);
if (!is_array($test)) {
echo $row['url'] . "\n";
}else{
if (count($test) == 0) {
echo $row['url'] . "\n";
}
}
}
?>
</pre>