Parsing every part of an HTTP header field-value
- by brickner
Hi all.
I'm parsing HTTP data directly from packets (either TCP reconstructed or not, you can assume it is).
I'm looking for the best way to parse HTTP as accurately as possible.
The main issue here is the HTTP header.
Looking at the basic RFC of HTTP/1.1, it seems that HTTP header parsing would be complex.
The RFC describes very complex regular expressions for different parts of the header.
Should I write these regular expressions to parse the different parts of the HTTP header?
The basic parsing I've written so far for HTTP header is for the generic HTTP header:
message-header = field-name ":" [ field-value ]
And I've included replacing inner LWS with SP and repeating headers with the same field-name with comma separated values as described in section 4.2.
However, looking at section 14.9 for example would show that in order to parse the different parts of the field-value I need a much more complex parsing scheme.
How do you suggest I should handle the complex parts of HTTP parsing (specifically the field-value) assuming I want to give the parser users the full capabilities of HTTP and to parse every part of HTTP?
Design suggestions for this would also be appreciated.
Thanks.