Splitting string on probable English word boundaries

Posted by Sean on Stack Overflow See other posts from Stack Overflow or by Sean
Published on 2010-02-13T18:26:52Z Indexed on 2010/05/10 1:38 UTC
Read the original article Hit count: 344

Filed under:

I recently used Adobe Acrobat Pro's OCR feature to process a Japanese kanji dictionary. The overall quality of the output is generally quite a bit better than I'd hoped, but word boundaries in the English portions of the text have often been lost. For example, here's one line from my file:

softening;weakening(ofthemarket)8 CHANGE [transform] oneselfINTO,takethe form of; disguise oneself

I could go around and insert the missing word boundaries everywhere, but this would be adding to what is already a substantial task. I'm hoping that there might exist software which can analyze text like this, where some of the words run together, and split the text on probable word boundaries. Is there such a package?

I'm using Emacs, so it'd be extra-sweet if the package in question were already an Emacs package or could be readily integrated into Emacs, so that I could simply put my cursor on a line like the above and repeatedly invoke some command that splits the line on word boundaries in decreasing order of probable correctness.

© Stack Overflow or respective owner

Related posts about text-analysis