Detect remote charset in php
- by yallaa
Hello,
I would like to determine a remote page's encoding through detection of the Content-Type header tag
<meta http-equiv="Content-Type" content="text/html; charset=XXXXX" />
if present.
I retrieve the remote page and try to do a regex to find the required setting if present.
I am still learning hence the problem below...
Here is what I have:
$EncStart = 'charset=';
$EncEnd = '" \/\>';
preg_match( "/$EncStart(.*)$EncEnd/s", $RemoteContent, $RemoteEncoding );
echo = $RemoteEncoding[ 1 ];
The above does indeed echo the name of the encoding but it does not know where to stop so it prints out the rest of the line then most of the rest of the remote page in my test.
Example: When testing a remote russian page it printed:
windows-1251" /
rest of page ....
Which means that $EncStart was okay, but the $EncEnd part of the regex failed to stop the matching. This meta header usually ends in 3 different possibility after the name of the encoding.
"> | "/> | " />
I do not know weather this is usable to satisfy the end of the maching and if yes how to escape it. I played with different ways of doing it but none worked.
Thank you in advance for lending a hand.