Performance of tokenizing CSS in PHP

Posted by Boldewyn on Stack Overflow See other posts from Stack Overflow or by Boldewyn
Published on 2010-04-09T18:53:26Z Indexed on 2010/04/09 19:33 UTC
Read the original article Hit count: 507

Filed under:

token

This is a noob question from someone who hasn't written a parser/lexer ever before.

I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).

It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.

My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.

Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?

I made some optimizations, like checking for '@', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.

My code (snippet) so far:

$TOKENS = array(
  'IDENT' => '...regexp...',
  'ATKEYWORD' => '@...regexp...',
  'String' => '"...regexp..."|\'...regexp...\'',
  //...
);

$string = '...CSS source string...';
$stream = array();

// we reduce $string token by token
while ($string != '') {
    $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
        // start is insignificant but doing a trim reduces exec time by 25%
    $matches = array();
    // loop through all possible tokens
    foreach ($TOKENS as $t => $p) {
        // The '&' is used as delimiter, because it isn't used anywhere in
        // the token regexps
        if (preg_match('&^'.$p.'&Su', $string, $matches)) {
            $stream[] = array($t, $matches[0]);
            $string = substr($string, strlen($matches[0]));
            // Yay! We found one that matches!
            continue 2;
        }
    }
    // if we come here, we have a syntax error and handle it somehow
}

// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content

Developer IT

Performance of tokenizing CSS in PHP - Developer IT

Performance of tokenizing CSS in PHP

parser

lexer

php

Performance

token

Related posts about parser

Core Data error when assigning variable with one-to-one relationship

RapidXML - does not compile ?

exception occured in java compiler

Doxygen C++ comment string parser in python?

Coding a parser for a domain specific language in Java

Related posts about lexer

Error in running script [closed]

problem string recursion antlr lexer token

Lexer antlr3 token problem

grammar parser lexer antlr letteral

ANTLR lexer mismatches tokens

Categories cloud