Using Markov models to convert all caps to mixed case and related problems

Posted by hippietrail on Stack Overflow See other posts from Stack Overflow or by hippietrail
Published on 2010-12-21T02:15:16Z Indexed on 2010/12/21 14:54 UTC
Read the original article Hit count: 365

Filed under:

unicode

|

nlp

|

ambiguity

|

markov-models

|

ngram

I've been thinking about using Markov techniques to restore missing information to natural language text.

Restore mixed case to text in all caps
Restore accents / diacritics to languages which should have them but have been converted to plain ASCII
Convert rough phonetic transcriptions back into native alphabets

That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.

I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities.

Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?

Examples

GEORGE LOST HIS SIM CARD IN THE BUSH -> George lost his SIM card in the bush
tantot il rit a gorge deployee -> tantôt il rit à gorge déployée

© Stack Overflow or respective owner

Related posts about unicode

Translating Between Unicode and Non-Unicode Character Sets in Java

as seen on Internet.com - Search for 'Internet.com'
You can use Java APIs not only to help translate characters, strings, and text streams to other languages, but also to convert Unicode character sets to non-Unicode and vice versa. >>> More
SQLite, python, unicode, and non-utf data

as seen on Stack Overflow - Search for 'Stack Overflow'
I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just… >>> More
SQLite, python, unicode, and non-utf data

as seen on Stack Overflow - Search for 'Stack Overflow'
I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just… >>> More
notepad sql Unicode and Non Unicode

as seen on Super User - Search for 'Super User'
Hi, I have a Microsoft Notepad flate file with data and Vertical Bar as column delimiter. I get following message: cannot convert between unicode and non-unicode string data types It seems it is my nvarchar(max) that creates my problem. I changed to varchar(max); but still the same problem. How… >>> More
On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U

as seen on Super User - Search for 'Super User'
On Windows 7, dir or tree can't show unicode characters, even starting cmd with cmd /U So I would press Window Key + R to run something, and type in cmd /U so that the content might handle Unicode. And then using dir or tree /F, the content in Unicode won't show as Unicode. (in Window Explorer… >>> More

Related posts about nlp

stanford pos tagger runs out of memory?

as seen on Stack Overflow - Search for 'Stack Overflow'
my stanford tagger ran out of memory. Is it because the text has to be properly formatted? This is because i use it to tag html contents, with the tags stripped, but there may have quite a excessive amount of newlines. here is the error: BlockquoWARNING: Untokenizable: ? (char in decimal: 9829) … >>> More
NLP with greatly contrained input and abilities

as seen on Stack Overflow - Search for 'Stack Overflow'
Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world… >>> More
NLP - Word Alignment

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi..:) I am looking for word alignment tools and algorithms, I am dealing with bilingual English - Hindi text, Currently I am working on DTW(Dynamic Time Warping) algorithm, CLA(Competitive Linking Algorithm) , NATool, Giza++. Could you please suggest me any other alogrithm/tool which is language… >>> More
AGFL npx grammar nlp techniques dependency parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I am trying to obtain a dependency parse tree using AGFL. Unfortunately I cannot understand how to derive this. I am trying to generate the npx grammar but I am still lost can someone help me please? Thanks :) L >>> More
Starting out NLP - Python + large data set

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the… >>> More