How to isolate a single element from a scraped web page in R

Posted by PaulHurleyuk on Stack Overflow See other posts from Stack Overflow or by PaulHurleyuk
Published on 2010-06-08T15:14:21Z Indexed on 2010/06/08 17:32 UTC
Read the original article Hit count: 375

Filed under:
|
|
|

Hello,

I'm trying to do soemone a favour, and it's a tad outside my comfort zone, so I'm stuck.

I want to use R to scrape this page (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times.

So far, this is what I've got

require(RCurl)
require(XML)

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE) 
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)  

pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)

and the pagetree object now contains a pointer to my parsed html (I think). The part I want is

<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
        <li>Philipp LAHM (GER) 6', </li>
        <li>Paulo WANCHOPE (CRC) 12', </li>
        <li>Miroslav KLOSE (GER) 17', </li>
        <li>Miroslav KLOSE (GER) 61', </li>
        <li>Paulo WANCHOPE (CRC) 73', </li>
        <li>Torsten FRINGS (GER) 87'</li>
</ul></div>

but I'm now lost as to how to isolate them, and frankly xpathSApply, xpathApply confuse the beejeebies out of me !.

So, does anyone know how to fomulate a command to suck out the element conmtaiend within the tags ?

Thanks

Paul.

© Stack Overflow or respective owner

Related posts about Xml

Related posts about r