trouble with utf-8 chars & apache2 rewrite rules

Posted by tixrus on Stack Overflow See other posts from Stack Overflow or by tixrus
Published on 2010-05-26T19:40:07Z Indexed on 2010/05/26 21:01 UTC
Read the original article Hit count: 452

Filed under:
|
|
|

I see the post http://stackoverflow.com/questions/2565864/validating-utf-8-in-htaccess-rewrite-rule and I think that is great, but a more fundamental problem I am having first:

I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.

I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.

I find this confusing, because the rule says put whatever you matched into the query string:

Here is the original rule: RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]

and here is the revised rule: RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.

It doesn't seem to matter which of those rules I use: Here is what happens:

In the application I have this:

echo $_GET['g'];

If I feed it a url like http://mydomain.com/puzzle/USA it echoes out "USA" and works fine.
If I feed it a url like http://mydomain.com/puzzle/México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.
if I feed it a url like http://mydomain.com/puzzle/fuzzle/buzzle/j.qle it does the same thing.
This last case should be a 404!

And it does this no matter which of the above rules I use. I configured a rewrite log

   RewriteLogLevel 5
   RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite

but it is empty.

Here is from the regular access log (it gives a status of 200)

[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/l.foo HTTP/1.1" 200 342

What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?

© Stack Overflow or respective owner

Related posts about php

Related posts about apache