trie reg exp parse step over char and continue

Posted by forest.peterson on Stack Overflow See other posts from Stack Overflow or by forest.peterson
Published on 2012-09-04T21:00:49Z Indexed on 2012/10/01 21:38 UTC
Read the original article Hit count: 237

Filed under:
|
|

Setup: 1) a string trie database formed from linked nodes and a vector array linking to the next node terminating in a leaf, 2) a recursive regular expression function that if A) char '*' continues down all paths until string length limit is reached, then continues down remaining string paths if valid, and B) char '?' continues down all paths for 1 char and then continues down remaining string paths if valid. 3) after reg expression the candidate strings are measured for edit distance against the 'try' string.

Problem: the reg expression works fine for adding chars or swapping ? for a char but if the remaining string has an error then there is not a valid path to a terminating leaf; making the matching function redundant. I tried adding a 'step-over' ? char if the end of the node vector was reached and then followed every path of that node - allowing this step-over only once; resulted in a memory exception; I cannot find logically why it is accessing the vector out of range - bactracking?

Questions: 1) how can the regular expression step over an invalid char and continue with the path? 2) why is swapping the 'sticking' char for '?' resulting in an overflow?

Function:

void Ontology::matchRegExpHelper(nodeT *w, string inWild, Set<string> &matchSet, string out, int level, int pos, int stepover)
{   
    if (inWild=="") {
        matchSet.add(out);
    } else {
        if (w->alpha.size() == pos) {
            int testLength = out.length() + inWild.length(); 
            if (stepover == 0 && matchSet.size() == 0 && out.length() > 8 && testLength == tokenLength) {//candidate generator
                inWild[0] = '?';
                matchRegExpHelper(w, inWild,  matchSet, out, level, 0, stepover+1);
            } else 
                return; //giveup on this path
        }
        if (inWild[0] == '?' || (inWild[0] == '*' && (out.length() + inWild.length() ) == level ) ) { //wild
            matchRegExpHelper(w->alpha[pos].next, inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover);//follow path -> if ontology is full, treat '*' like a '?'
        } else if (inWild[0] == '*')
            matchRegExpHelper(w->alpha[pos].next, '*'+inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover); //keep adding chars
        if (inWild[0] == w->alpha[pos].letter) //follow self
            matchRegExpHelper(w->alpha[pos].next, inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover); //follow char 
        matchRegExpHelper(w, inWild, matchSet, out, level, pos+1, stepover);//check next path
    }
}

Error Message:

+str    "Attempt to access index 1 in a vector of size 1." std::basic_string<char,std::char_traits<char>,std::allocator<char> >
+err    {msg="Attempt to access index 1 in a vector of size 1." } ErrorException

Note: this function works fine for hundreds of test strings with '*' wilds if the extra stepover gate is not used

Semi-Solved: I place a pos < w->alpha.size() condition on each path that calls w->alpha[pos]... - this prevented the backtrack calls from attempting to access the vector with an out of bounds index value. Still have other issues to work out - it loops infinitely adding the ? and backtracking to remove it, then repeat. But, moving forward now.

Revised question: why during backtracking is the position index accumulating and/or not deincrementing - so at somepoint it calls w->alpha[pos]... with an invalid position that is either remaining from the next node or somehow incremented pos+1 when passing upward?

© Stack Overflow or respective owner

Related posts about recursion

Related posts about string-matching