Is there a way to optimise finding text items on a page (not regex)

Posted by Jeepstone on Stack Overflow See other posts from Stack Overflow or by Jeepstone
Published on 2010-05-06T10:25:25Z Indexed on 2010/05/06 10:28 UTC
Read the original article Hit count: 236

Filed under:

php

|

regex

|

matching

|

parsing

|

optimization

After seeing several threads rubbishing the regexp method of finding a term to match within an HTML document, I've used the Simple HTML DOM PHP parser (http://simplehtmldom.sourceforge.net/) to get the bits of text I'm after, but I want to know if my code is optimal. It feels like I'm looping too many times. Is there a way to optimise the following loop?

//Get the HTML and look at the text nodes
   $html = str_get_html($buffer);
   //First we match the <body> tag as we don't want to change the <head> items
   foreach($html->find('body') as $body) {
    //Then we get the text nodes, rather than any HTML
    foreach($body->find('text') as $text) {
     //Then we match each term
     foreach ($terms as $term) {
      //Match to the terms within the text nodes
      $text->outertext = str_replace($term, '<span class="highlight">'.$term.'</span>', $text->outertext);
     }       
    }
   }

For example, would it make a difference to determine check if I have any matches before I start the loop maybe?

© Stack Overflow or respective owner

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

as seen on Server Fault - Search for 'Server Fault'
I have been trying to resolve my issue of spiking cpu caused by php-fpm processes. I've reduced the php-fpm config settings to: pm = ondemand pm.max_children = 12 pm.start_servers = 2 pm.min_spare_servers = 2 pm.max_spare_servers = 10 pm.max_requests = 500 php_admin_value[memory_limit] = 128M Problem… >>> More
PHP Pear Installation on CentOS

as seen on Server Fault - Search for 'Server Fault'
[root@ip ~]# yum install php-pear* Reducing CentOS-5 Testing to included packages only Finished Setting up Install Process Package 1:php-pear-1.8.1-2.el5.centos.noarch already installed and latest versio … >>> More
Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

as seen on Server Fault - Search for 'Server Fault'
I am taking over an application server and discover that it contain the following settings: AddType text/html php Although it works, but my understanding is that it should set as following: AddType application/x-httpd-php php .php What are the key differences between the two settings?… >>> More
mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

as seen on Server Fault - Search for 'Server Fault'
Hello. I have a server with VBulletin forum (working under Apache 2.2, CentOS). The default settings for it in .htaccess are as follows: RewriteEngine on RewriteCond %{HTTP_HOST} ^gsmforum\.ru RewriteRule (.*) http://www.gsmforum.ru/$1 [R=301,L] # If you are having problems or are using VirtualDocumentRoot… >>> More
Problems installing Memcache (PECL extension)

as seen on Server Fault - Search for 'Server Fault'
I have installed memcached fine, and now I will need to install PECL extension memcache. Im running RedHat x86_64 es5. The installation gives me this: downloading memcache-2.2.6.tgz ... Starting to download memcache-2.2.6.tgz (35,957 bytes) ..........done: 35,957 bytes 11 source files, building running:… >>> More

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More