PCRE multi line matche problem

Posted by Simone Margaritelli on Stack Overflow See other posts from Stack Overflow or by Simone Margaritelli
Published on 2010-04-18T22:03:50Z Indexed on 2010/04/18 22:23 UTC
Read the original article Hit count: 553

Filed under:
|
|

Hi guys, i have this C++ program (actually it's just a snippet) :

#include <iostream>
#include <pcre.h>
#include <string>

using namespace std;

int main(){    
    string pattern = "<a\\s+href\\s*=\\s*\"([^\"]+)\"",
           html = "<html>\n"
                  "<body>\n"
                  "<a href=\"example_link_1\"/>\n"
                  "<a href=\"example_link_2\"/>\n"
                  "<a href=\"example_link_3\"/>\n"
                  "</body>\n"
                  "</html>";
    int            i, ccount, rc,
                *offsets,
                 eoffset;
    const char  *error;
    pcre         *compiled;

    compiled = pcre_compile( pattern.c_str(), PCRE_CASELESS | PCRE_MULTILINE, &error, &eoffset, 0 );
    if( !compiled ){
        cerr << "Error compiling the regexp!!" << endl;
        return 0;
    }

    rc = pcre_fullinfo( compiled, 0, PCRE_INFO_CAPTURECOUNT, &ccount );

    offsets = new int[ 3 * (ccount + 1) ];

    rc = pcre_exec( compiled, 0, html.c_str(), html.length(), 0, 0, offsets, 3 * (ccount + 1) );

    if( rc >= 0 ){
        for( i = 1; i < rc; ++i ){
            cout << "Match : " << html.substr( offsets[2*i], offsets[2*i+1] - offsets[2*i] ) << endl;
        }
    }
    else{
        cout << "Sorry, no matches!" << endl;
    }

    delete [] offsets;

    return 0;
}

As you can see, i'm trying to match html links inside a buffer with the given regular expression (the \\s is \s escaped for C/C++ strings). But, even if in the buffer there are 3 links and the regexp is compiled with the PCRE_CASELESS and PCRE_MULTILINE flags, i match only one element :

Match : example_link_1

Note: I start the loop fro index 1 because the pcre library returns the string that matched (not the match itself) as the first element, and the matches follows.

What's wrong with this code? The regexp itself i think it's correct (tried in PHP for instance).

© Stack Overflow or respective owner

Related posts about c++

Related posts about pcre