How do I extract HTML content using Regex in PHP

Posted by gAMBOOKa on Stack Overflow See other posts from Stack Overflow or by gAMBOOKa
Published on 2010-05-12T12:06:50Z Indexed on 2010/05/12 12:14 UTC
Read the original article Hit count: 546

Filed under:

php

|

regex

|

html

|

extract

|

html-content-extraction

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd need a separate function for each website.

In any case, I don't know much about regexes, so bear with me.

I've got an HTML page in a format similar to this

<html>
<head>...</head>
<body>
    <div class=nav>...</div><p id="someshit" />
    <div class=body>....</div>
    <div class=footer>...</div>
</body>

I need to extract the contents of the body class container.

I tried this.

$pattern = "/<div class=\"body\">\(.*?\)<\/div>/sui"
$text = $htmlPageAsIs;
if (preg_match($pattern, $text, $matches))
    echo "MATCHED!";
else
    echo "Sorry gambooka, but your text is in another castle.";

What am I doing wrong? My text ends up in another castle.

© Stack Overflow or respective owner

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

as seen on Server Fault - Search for 'Server Fault'
I have been trying to resolve my issue of spiking cpu caused by php-fpm processes. I've reduced the php-fpm config settings to: pm = ondemand pm.max_children = 12 pm.start_servers = 2 pm.min_spare_servers = 2 pm.max_spare_servers = 10 pm.max_requests = 500 php_admin_value[memory_limit] = 128M Problem… >>> More
PHP Pear Installation on CentOS

as seen on Server Fault - Search for 'Server Fault'
[root@ip ~]# yum install php-pear* Reducing CentOS-5 Testing to included packages only Finished Setting up Install Process Package 1:php-pear-1.8.1-2.el5.centos.noarch already installed and latest versio … >>> More
Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

as seen on Server Fault - Search for 'Server Fault'
I am taking over an application server and discover that it contain the following settings: AddType text/html php Although it works, but my understanding is that it should set as following: AddType application/x-httpd-php php .php What are the key differences between the two settings?… >>> More
mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

as seen on Server Fault - Search for 'Server Fault'
Hello. I have a server with VBulletin forum (working under Apache 2.2, CentOS). The default settings for it in .htaccess are as follows: RewriteEngine on RewriteCond %{HTTP_HOST} ^gsmforum\.ru RewriteRule (.*) http://www.gsmforum.ru/$1 [R=301,L] # If you are having problems or are using VirtualDocumentRoot… >>> More
Problems installing Memcache (PECL extension)

as seen on Server Fault - Search for 'Server Fault'
I have installed memcached fine, and now I will need to install PECL extension memcache. Im running RedHat x86_64 es5. The installation gives me this: downloading memcache-2.2.6.tgz ... Starting to download memcache-2.2.6.tgz (35,957 bytes) ..........done: 35,957 bytes 11 source files, building running:… >>> More

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More