string categorization strategies
- by Andrew Heath
I'm the one-man dev team on a fledgling military history website. One aspect of the site is a catalog of ~1,200 individual battles, including the nations & formations (regiments, divisions, etc) which took part.
The formation information (as well as the other battle info) was manually imported from a series of books by a 10-man volunteer team. The formations were listed in groups with varying formatting and abbreviation patterns. At the time I set up the data collection forms I couldn't think of a good way to process that data... and elected to store it all as strings in the MySQL database and sort it out later.
Well, "later" - as it tends to happen - has arrived. :-)
Each battle has 2+ records in the database - one for each nation that participated. Each record has a formations text string listing the formations present as the volunteer chose to add them.
Some real examples:
39th Grenadier Rgmt, 26th Volksgrenadier Division
2nd Luftwaffe Field Division, 246th Infantry Division
247th Rifle Division, 255th Tank Brigade
2nd Luftwaffe Field Division, SS Cavalry Division
28th Tank Brigade, 158th Rifle Division, 135th Rifle Division, 81st Tank Brigade, 242nd Tank Brigade
78th Infantry Division
3rd Kure Special Naval Landing Force, Tulagi Seaplane Base personnel
1st Battalion 505th Infantry Regiment
The ultimate goal is for each individual force to have an ID, so that its participation can be traced throughout the battle database. Formation hierarchy, such as the final item above 1st Battalion (of the) 505th Infantry Regiment also needs to be preserved. In that case, 1st Battalion and 505th Infantry Regiment would be split, but 1st Battalion would be flagged as belonging to the 505th.
In database terms, I think I want to pull the formation field out of the current battle info table and create three new tables:
FORMATION
[id] [name]
FORMATION_HIERARCHY
[id] [parent] [child]
FORMATION_BATTLE
[f_id] [battle_id]
It's simple to explain, but complicated to enact.
What I'm looking for from the SO community is just some tips on how best to tackle this problem. Ideally there's some sort of method to solving this that I'm not aware of. However, as a last resort, I could always code a classification framework and call my volunteers back to sort through 2,500+ records...