Regex matching very slow
- by Ali Lown
I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is "BT.*?((.*?)).*?ET" (with DOTMATCHALL set) to match something like:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
The only bit I want is the text ABC in the brackets.
The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.
How may I speed this up?