Using Markov models to convert all caps to mixed case and related problems

Posted by hippietrail on Stack Overflow See other posts from Stack Overflow or by hippietrail
Published on 2010-12-21T02:15:16Z Indexed on 2010/12/21 14:54 UTC
Read the original article Hit count: 297

Filed under:
|
|
|
|

I've been thinking about using Markov techniques to restore missing information to natural language text.

  • Restore mixed case to text in all caps
  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII
  • Convert rough phonetic transcriptions back into native alphabets

That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.

I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities.

Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?

Examples

  • GEORGE LOST HIS SIM CARD IN THE BUSH -> George lost his SIM card in the bush
  • tantot il rit a gorge deployee -> tantôt il rit à gorge déployée

© Stack Overflow or respective owner

Related posts about unicode

Related posts about nlp