How to convert none-Latin-based encoded text into UTF-8, or make them coexist on same page?
- by Yallaa
Good day,
I have a script that scrapes the title/description of remote pages and prints those values into a corresponding charset=UTF-8 encoded page. Here is the problem, whenever a remote page is encoded with non-Latin characters encoding like (Arabic, Russian, Chinese, Japanese etc.) the imported values print as garbled text.
I've tried passing those values through either iconv or mb_convert_encoding converters but without much success.
Then, I tried detecting the remote encoding first, then change my presentation page's encoding into the remote one instead of the current utf-8, which works okay with the imported values, but the other existing utf-8 content of that language on the page gets garbled instead.
Example:
If I try to import those values from a Russian windows-1251 into my UTF-8 encoded page
which has a mix English/Russian content. I change the imported non-utf-8 string into a utf-8 using either iconv or mb_convert_encoding.
I tried:
$RemoteValue = iconv($RemoteEncoding, 'UTF-8', $RemoteValue);
or
$RemoteValue mb_convert_encoding($RemoteValue, "UTF-8", $RemoteEncoding);
or
$RemoteValue mb_convert_encoding($RemoteValue, "UTF-8", "auto");
without success.
If I detect that the remote page is windows-1251 encoded and I change my presentation page (which already has UTF-8 encoded mixed language content) to be similar to the remote page, then the japanese utf-8 content on the existing page gets garbled...
Can 2 differently encoded strings coexist on the same page (ex. utf-8 & windows-1251)?
Am I using the converters correctly? any hints as to why they don't work?
Is there any better way to do this?
Thank you for your help