Using Markov models to convert all caps to mixed case and related problems
Posted
by
hippietrail
on Stack Overflow
See other posts from Stack Overflow
or by hippietrail
Published on 2010-12-21T02:15:16Z
Indexed on
2010/12/21
14:54 UTC
Read the original article
Hit count: 297
I've been thinking about using Markov techniques to restore missing information to natural language text.
- Restore mixed case to text in all caps
- Restore accents / diacritics to languages which should have them but have been converted to plain ASCII
- Convert rough phonetic transcriptions back into native alphabets
That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.
I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities.
Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?
Examples
- GEORGE LOST HIS SIM CARD IN THE BUSH -> George lost his SIM card in the bush
- tantot il rit a gorge deployee -> tantôt il rit à gorge déployée
© Stack Overflow or respective owner