Regex matching very slow

Posted by Ali Lown on Stack Overflow See other posts from Stack Overflow or by Ali Lown
Published on 2010-04-01T20:08:58Z Indexed on 2010/04/01 20:13 UTC
Read the original article Hit count: 454

Filed under:
|
|

I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is "BT.*?((.*?)).*?ET" (with DOTMATCHALL set) to match something like:

BT
   /F13 12 Tf
   288 720 Td
   (ABC) Tj
ET

The only bit I want is the text ABC in the brackets.

The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.

How may I speed this up?

© Stack Overflow or respective owner

Related posts about regex

Related posts about slow