Hard to append a table with many records into another without generating duplicates

Posted by Bill Mudry on Stack Overflow See other posts from Stack Overflow or by Bill Mudry
Published on 2010-06-12T05:15:18Z Indexed on 2010/06/12 5:22 UTC
Read the original article Hit count: 336

Filed under:

I may seem to be a bit wordy at first but for the hope it will be easier for all of you to understand what I am doing in the first place. I have an uncommon but enjoyable activity of collecting as many species of wood from around the world as I can (over 2,900 so far). Ok, that is the real world.

Meanwhile I have spent over 8 years compiling over 5.8 meg of text data on all the woods of the world. That got so large that learning some basic PHP and MySQL was most welcome so I could build a new database driven home for all this research. I am still slow at it but getting there. The original premise was to find evidence of as many species of woods in the world I can. The more names identified, the more successful the project.

I have named the project TAXA for ease of conversation (short for Taxonomy). You are most welcome to take a look at what I have so far at www.prowebcanada.com/taxa. It is 95% dynamically driven. So far I am reporting about 6,500 botanical wood names and, as said above, the more I can report, the more successful is the project.

I have a file of all the woods in the second largest wood collection in the world, the Tervuren wood collection in the Netherlands with over 11,300 wood names even after cleaning out all duplicates. That is almost twice the number I am reporting now so porting all the new wood names from Tervuren to the 'species' table where I keep the reported data would be a major desirable advancement in the project.

At one point I was able to add all the Tervuren records to the species table but over 3,000 duplicates also formed. They were not in the Tervuren file in the first place but represent the same wood names common to both files. It is common sense that there would be woods common to both that when merged would create new duplicates.

At one point and with the help of others from another forum, I may very well have finally got the proper SQL statement. When I ran it, though, the system said (semi-amusingly at first) ----- that it had gone away! After looking up on the Net what could have have done this, one reason is that the MySQL timeout lapses and probably because of the large size of files I am running. I am running this on a rented account on Godaddy so I cannot go about trying to adjust any config file.

For safety, I copied the tervuren.sql file as tervuren_target.sql and the species.sql file as species_master.sql tp use as working files just to make sure I protect the original files from destruction or damage. Later I can name the species_master back to just species.sql once I am happy all worked well.

The species file has about 18 columns in it but only 5 columns match the columns in the Tervuren file (name for name and collation also). The rest of the columns are just along for the ride, so to speak. The common key in both is the 'species_name" columns in both. I am not sure it is at all proper to call one a primary key and the other a foreign key since there really is no relational connection to them. One is just more data for the other and can disappear after, never to be referred to the working code in the application.

I have been very surprised and flabbergasted on how hard it can be to append records from one large table into another (with same column names plus others) without generating NEW duplicates in the first place. Watch out thinking that a SELECT DISTINCT statement may do the job because absolutely NO records in the species table must get destroyed in the process and there is no way (well, that I know of) to tell the 'DISTINCT" command this. Yes, the original 'species' table has duplicates in it even before all this but, trust me ---- they have to be removed the long hard way manually record by record or I will lose precious information. It is more important to just make sure no NEW duplicates form through bringing in new names in the tervuren_target.species_name into species.species_name.

I am hoping and thinking that a straight SQL solution should work --- except for that nasty timeout. How do I get past that? Could it mean that I may have to turn to a PHP plus SQL method?? Or ..... would I have to break up the Tervuren files into a few smaller ones and run them independently (hope not....)"

So far, what seems should be easy has proven to be unexpectedly tricky. I appreciate any help you can give but start from the assumption that this may be harder to do right than it may seem on the surface.

By the way --- I am running a quad 64 bit system with Windows 7, so at least I have some fairly hefty power on the client end. I have a direct ethernet cable feeding a cable connection to the Internet.

Once I get an algorithm and code working for this, I also have many other lists to process that could make the 'species' table grow even more. It could be equivalent to (ahem) lighting a rocket under my project (especially compared to do this record by record manually)!

This is my first time in this forum, so I do not know how I can receive any replies. Do I have to to come back here periodically or are replies emailed out also? It would be great if you CC'd copies to me at billmudry at rogers.com :-)

Much thanks for your patience and help,

Bill Mudry Mississauga, Ontario Canada (next to Toronto).

© Stack Overflow or respective owner

Related posts about mysql