How to convert none-Latin-based encoded text into UTF-8, or make them coexist on same page?

Posted by Yallaa on Stack Overflow See other posts from Stack Overflow or by Yallaa
Published on 2010-04-19T17:20:57Z Indexed on 2010/04/19 17:23 UTC
Read the original article Hit count: 365

Good day,

I have a script that scrapes the title/description of remote pages and prints those values into a corresponding charset=UTF-8 encoded page. Here is the problem, whenever a remote page is encoded with non-Latin characters encoding like (Arabic, Russian, Chinese, Japanese etc.) the imported values print as garbled text.

I've tried passing those values through either iconv or mb_convert_encoding converters but without much success.

Then, I tried detecting the remote encoding first, then change my presentation page's encoding into the remote one instead of the current utf-8, which works okay with the imported values, but the other existing utf-8 content of that language on the page gets garbled instead.

Example:
If I try to import those values from a Russian windows-1251 into my UTF-8 encoded page which has a mix English/Russian content. I change the imported non-utf-8 string into a utf-8 using either iconv or mb_convert_encoding.

I tried:
$RemoteValue = iconv($RemoteEncoding, 'UTF-8', $RemoteValue);
or
$RemoteValue mb_convert_encoding($RemoteValue, "UTF-8", $RemoteEncoding);
or
$RemoteValue mb_convert_encoding($RemoteValue, "UTF-8", "auto");
without success.

If I detect that the remote page is windows-1251 encoded and I change my presentation page (which already has UTF-8 encoded mixed language content) to be similar to the remote page, then the japanese utf-8 content on the existing page gets garbled...

  • Can 2 differently encoded strings coexist on the same page (ex. utf-8 & windows-1251)?
  • Am I using the converters correctly? any hints as to why they don't work?
  • Is there any better way to do this?

Thank you for your help

© Stack Overflow or respective owner

Related posts about character-encoding

Related posts about iconv