Using Markov models to convert all caps to mixed case and related problems
- by hippietrail
I've been thinking about using Markov techniques to restore missing information to natural language text.
Restore mixed case to text in all caps
Restore accents / diacritics to languages which should have them but have been converted to plain ASCII
Convert rough phonetic transcriptions back into native alphabets
That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.
I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities.
Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?
Examples
GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush
tantot il rit a gorge deployee - tantôt il rit à gorge déployée