A better way of converting Codepage-1251 in RTF to Unicode

Posted by blue painted on Stack Overflow See other posts from Stack Overflow or by blue painted
Published on 2010-03-15T16:05:13Z Indexed on 2010/03/15 16:09 UTC
Read the original article Hit count: 618

Filed under:
|
|

I am trying to parse RTF (via MSEDIT) in various languages, all in Delphi 2010, in order to produce HTML in unicode.

Taking Russian/Cyrillic as my starting point I find that the overall document codepage is 1252 (Western) but the Russian parts of the text are identified by the charset of the font (RUSSIAN_CHARSET 204).

So far I am:

1) Use AnsiString (or RawByteString) when parsing the RTF

2) Determine the CodePage by a lookup from the font charset (see http://msdn.microsoft.com/en-us/library/cc194829.aspx)

3) Translating using a lookup table in my code: (This table generated from http://msdn.microsoft.com/en-gb/goglobal/cc305144.aspx) - I'm going to need one table per supported codepage!

There MUST be a better way than this? Preferably something supplied by the OS and so less brittle than tables of constants.

© Stack Overflow or respective owner

Related posts about delphi

Related posts about unicode