How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?

Posted by Tal Galili on Stack Overflow See other posts from Stack Overflow or by Tal Galili
Published on 2010-03-14T18:03:23Z Indexed on 2010/03/14 18:05 UTC
Read the original article Hit count: 366

Filed under:
|
|
|

Hi all,

I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

  1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
  2. And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,

Tal

© Stack Overflow or respective owner

Related posts about r

    Related posts about webscraping