How can I use R (Rcurl/XML packages ?!) to scrap this webpage ?
Posted
by Tal Galili
on Stack Overflow
See other posts from Stack Overflow
or by Tal Galili
Published on 2010-03-14T18:03:23Z
Indexed on
2010/03/14
18:05 UTC
Read the original article
Hit count: 366
Hi all,
I have a (somewhat complex) webscraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
So for each of them I will go to:
- The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
- And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal
© Stack Overflow or respective owner