What are common approaches for translating certain words (or expressions) inside a given text, when the text must be reconstructed (with punctuations and everythin.) ?
The translation comes from a lookup table, and covers words, collocations, and emoticons like L33t, CUL8R, :-), etc.
Simple string search-and-replace is not enough since it can replace part of longer words (cat dog ? caterpillar dogerpillar).
Assume the following input:
s = "dogbert, started a dilbert dilbertion proces cat-bert :-)"
after translation, i should receive something like:
result = "anna, started a george dilbertion process cat-bert smiley"
I can't simply tokenize, since i loose punctuations and word positions.
Regular expressions, works for normal words, but don't catch special expressions like the smiley :-) but it does .
re.sub(r'\bword\b','translation',s) ==> translation
re.sub(r'\b:-\)\b','smiley',s) ==> :-)
for now i'm using the above mentioned regex, and simple replace for the non-alphanumeric words, but it's far from being bulletproof.
(p.s. i'm using python)