Hard to append a table with many records into another without generating duplicates
- by Bill Mudry
I may seem to be a bit wordy at first but for the hope it will be easier for all of you to understand what I am doing in the first place. I have an uncommon but enjoyable activity of collecting as many species of wood from around the world as I can (over 2,900 so far). Ok, that is the real world.
Meanwhile I have spent over 8 years compiling over 5.8 meg of text data on all the woods of the world. That got so large that learning some basic PHP and MySQL was most welcome so I could build a new database driven home for all this research. I am still slow at it but getting there. The original premise was to find evidence of as many species of woods in the world I can. The more names identified, the more successful the project.
I have named the project TAXA for ease of conversation (short for Taxonomy). You are most
welcome to take a look at what I have so far at
www.prowebcanada.com/taxa.
It is 95% dynamically driven. So far I am reporting about 6,500 botanical wood names and, as said above, the more I can report, the more successful is the project.
I have a file of all the woods in the second largest wood collection in the world, the
Tervuren wood collection in the Netherlands with over 11,300 wood names even after cleaning out all duplicates. That is almost twice the number I am reporting now so
porting all the new wood names from Tervuren to the 'species' table where I keep the
reported data would be a major desirable advancement in the project.
At one point I was able to add all the Tervuren records to the species table but over
3,000 duplicates also formed. They were not in the Tervuren file in the first place but
represent the same wood names common to both files. It is common sense that there would
be woods common to both that when merged would create new duplicates.
At one point and with the help of others from another forum, I may very well have finally
got the proper SQL statement. When I ran it, though, the system said (semi-amusingly at
first) ----- that it had gone away! After looking up on the Net what could have have done
this, one reason is that the MySQL timeout lapses and probably because of the large size of
files I am running. I am running this on a rented account on Godaddy so I cannot go
about trying to adjust any config file.
For safety, I copied the tervuren.sql file as tervuren_target.sql and the species.sql
file as species_master.sql tp use as working files just to make sure I protect the original
files from destruction or damage. Later I can name the species_master back to just
species.sql once I am happy all worked well.
The species file has about 18 columns in it but only 5 columns match the columns in the Tervuren file (name for name and collation also). The rest of the columns are just along
for the ride, so to speak. The common key in both is the 'species_name" columns in both.
I am not sure it is at all proper to call one a primary key and the other a foreign key
since there really is no relational connection to them. One is just more data for the
other and can disappear after, never to be referred to the working code in the application.
I have been very surprised and flabbergasted on how hard it can be to append records from
one large table into another (with same column names plus others) without generating
NEW duplicates in the first place. Watch out thinking that a SELECT DISTINCT statement
may do the job because absolutely NO records in the species table must get destroyed in
the process and there is no way (well, that I know of) to tell the 'DISTINCT" command
this. Yes, the original 'species' table has duplicates in it even before all this but,
trust me ---- they have to be removed the long hard way manually record by record or
I will lose precious information. It is more important to just make sure no NEW
duplicates form through bringing in new names in the tervuren_target.species_name into species.species_name.
I am hoping and thinking that a straight SQL solution should work --- except for that
nasty timeout. How do I get past that? Could it mean that I may have to turn to a PHP
plus SQL method?? Or ..... would I have to break up the Tervuren files into a few
smaller ones and run them independently (hope not....)"
So far, what seems should be easy has proven to be unexpectedly tricky. I appreciate any help you can give but start from the assumption that this may be harder to do right than it may seem on the surface.
By the way --- I am running a quad 64 bit system with Windows 7, so at least I have some
fairly hefty power on the client end. I have a direct ethernet cable feeding a cable
connection to the Internet.
Once I get an algorithm and code working for this, I also have many other lists to
process that could make the 'species' table grow even more. It could be equivalent to
(ahem) lighting a rocket under my project (especially compared to do this record by
record manually)!
This is my first time in this forum, so I do not know how I can receive any replies.
Do I have to to come back here periodically or are replies emailed out also? It would be great if you CC'd copies to me at billmudry at rogers.com :-)
Much thanks for your patience and help,
Bill Mudry
Mississauga, Ontario Canada (next to Toronto).