Need help with regex parsing (in perl)

Posted by Charlie on Stack Overflow See other posts from Stack Overflow or by Charlie
Published on 2010-05-02T03:14:01Z Indexed on 2010/05/02 3:17 UTC
Read the original article Hit count: 248

Filed under:
|

Hi all, need some help parsing an html file in perl.

I used the LWP module to retrieve a webpage into $_ with $/ undefined so there are no newline issues. Then I'm trying to find all strings matching a pattern. How do I do that? I know how to find 1 instance of it, but how do I match all instances? and what data structure would the results go to? a multi dimensional array?

my text (excerpt) looks like the following:

<TR> 
 <TD BGCOLOR=EEEEEE><A HREF="/program.cgi?pid=1233"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 1</A></FONT></TD> 
 <TD BGCOLOR=EEEEEE nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jun 27 2010  3:00PM</FONT></TD> 
 <TD BGCOLOR=EEEEEE>&nbsp;</TD> 
</TR> 
<TR><TD BGCOLOR=EEEEEE COLSPAN=3><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR><TD COLSPAN=3 BGCOLOR=999999><IMG SRC="http://images.domain.com/images/spacer.gif" HEIGHT=1 WIDTH=1></TD></TR> 
<TR><TD COLSPAN=3 ><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR> 
 <TD><A HREF="/program.cgi?pid=1234"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 2</A></FONT></TD> 
 <TD nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jun 29 2010  7:00PM</FONT></TD> 
 <TD>&nbsp;</TD> 
</TR> 
<TR><TD COLSPAN=3><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR><TD COLSPAN=3 BGCOLOR=999999><IMG SRC="http://images.domain.com/images/spacer.gif" HEIGHT=1 WIDTH=1></TD></TR> 
<TR><TD COLSPAN=3  BGCOLOR=EEEEEE><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR> 
 <TD BGCOLOR=EEEEEE><A HREF="/program.cgi?pid=1235"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 3</A></FONT></TD> 
 <TD BGCOLOR=EEEEEE nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jul  3 2010  7:00PM</FONT></TD> 
 <TD BGCOLOR=EEEEEE>&nbsp;</TD> 
</TR> 

I want to get the following into an array (or any structure):

{ ["/program.cgi?pdi=1233", "Title 1"], ["/program.cgi?pdi=1234", "Title 2"], ["/program.cgi?pdi=1235", "Title 3"] }

Thanks

© Stack Overflow or respective owner

Related posts about perl

Related posts about regex