How to parse a string (by a "new" markup) with R ?

Posted by Tal Galili on Stack Overflow See other posts from Stack Overflow or by Tal Galili
Published on 2010-03-16T10:13:00Z Indexed on 2010/03/16 10:16 UTC
Read the original article Hit count: 189

Filed under:
|
|
|

Hi all,

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.

For example, let's say we have the following two variables:

Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."

Say that I want to parse "Seq" According to "Str", by using the legend here

Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
     |     |  |              | |               |     |               ||     |
     +-----+  +--------------+ +---------------+     +---------------++-----+
        |        Stem 1            Stem 2                 Stem 3         |
        |                                                                |
        +----------------------------------------------------------------+
                                Stem 0

Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.

The output should be something like the following list structure:

list(
    "Stem 0 opening" = "GCCTCGA",
    "before Stem 1" = "TA",
    "Stem 1" = list(opening = "GCTC",
                inside = "AGTTGGGA",
                closing = "GAGC"
            ),
    "between Stem 1 and 2" = "G",
    "Stem 2" = list(opening = "TACGA",
                inside = "CTGAAGA",
                closing = "TCGTA"
            ),
    "between Stem 2 and 3" = "AGGtC",
    "Stem 3" = list(opening = "ACCAG",
                inside = "TTCGATC",
                closing = "CTGGT"
            ),
    "After Stem 3" = "",
    "Stem 0 closing" = "TCGGGGC"
)

I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).

What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into: 1. before stem 2. opening stem 3. inside stem 4. closing stem 5. after stem

Where the "after stem" will then be recursively entered into the same function ("seperate.stem")

The thing is that I am not sure how to try and do this coding without using a loop.

Any advices will be most welcomed.

© Stack Overflow or respective owner

Related posts about r

    Related posts about string