Regex matching very slow
Posted
by Ali Lown
on Stack Overflow
See other posts from Stack Overflow
or by Ali Lown
Published on 2010-04-01T20:08:58Z
Indexed on
2010/04/01
20:13 UTC
Read the original article
Hit count: 459
I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is "BT.*?((.*?)).*?ET" (with DOTMATCHALL set) to match something like:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
The only bit I want is the text ABC in the brackets.
The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.
How may I speed this up?
© Stack Overflow or respective owner