Regex matching very slow

Posted by Ali Lown on Stack Overflow See other posts from Stack Overflow or by Ali Lown
Published on 2010-04-01T20:08:58Z Indexed on 2010/04/01 20:13 UTC
Read the original article Hit count: 532

Filed under:

regex

|

slow

|

pdf

I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is "BT.*?((.*?)).*?ET" (with DOTMATCHALL set) to match something like:

BT
   /F13 12 Tf
   288 720 Td
   (ABC) Tj
ET

The only bit I want is the text ABC in the brackets.

The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.

How may I speed this up?

© Stack Overflow or respective owner

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More

Related posts about slow

MS Windows Server 2008R2 slow file copy, slow network connection

as seen on Server Fault - Search for 'Server Fault'
i just setup a windows 2008R2 standard server, with the only installed app being Hyper-V, and only 1 windows XP VM is running. Whenever i try to copy a file from my windows 7 laptop over to the 2008R2 server machine's admin shares ( \\servername\c$ ) the files start transferring around 60mb/s and… >>> More
how to ajax your confluence pages so slow macros dont slow down page load

as seen on Stack Overflow - Search for 'Stack Overflow'
some pages that use macros like: topusers or popularlabels are really slow to load. Is there any way to have this load asynchronously through ajax instead of having this block the initial page load ? >>> More
how to ajax your confluence pages so slow macros dont slow down page load

as seen on Stack Overflow - Search for 'Stack Overflow'
some pages that use macros like: topusers or popularlabels are really slow to load. Is there any way to have this load asynchronously through ajax instead of having this block the initial page load ? >>> More
Why is Java .Net so slow? [closed]

as seen on Programmers - Search for 'Programmers'
Ive just tried to use atmosphere.java.net (to see what it is) and I am not able to do it. Why is Java .NET so slow? Is it because they used Java in the server and are as incompetent as people whom write questions for SCJP? Does Sun have any competent employee and have you ever saw one ? (except the… >>> More
Ubuntu 12.04 Beta 2 takes over 2 minutes to boot up! [closed]

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Possible Duplicate: There's an issue with an Alpha/Beta Release of Ubuntu, what should I do? I've installed Ubuntu 12.04 Beta 2 for testing purposes. When I power on the computer now, I get the following message beneath the ubuntu logo: Waiting for network configuration About a minute… >>> More