trie reg exp parse step over char and continue
- by forest.peterson
Setup: 1) a string trie database formed from linked nodes and a vector array linking to the next node terminating in a leaf, 2) a recursive regular expression function that if A) char '*' continues down all paths until string length limit is reached, then continues down remaining string paths if valid, and B) char '?' continues down all paths for 1 char and then continues down remaining string paths if valid. 3) after reg expression the candidate strings are measured for edit distance against the 'try' string.
Problem: the reg expression works fine for adding chars or swapping ? for a char but if the remaining string has an error then there is not a valid path to a terminating leaf; making the matching function redundant. I tried adding a 'step-over' ? char if the end of the node vector was reached and then followed every path of that node - allowing this step-over only once; resulted in a memory exception; I cannot find logically why it is accessing the vector out of range - bactracking?
Questions: 1) how can the regular expression step over an invalid char and continue with the path? 2) why is swapping the 'sticking' char for '?' resulting in an overflow?
Function:
void Ontology::matchRegExpHelper(nodeT *w, string inWild, Set<string> &matchSet, string out, int level, int pos, int stepover)
{
if (inWild=="") {
matchSet.add(out);
} else {
if (w->alpha.size() == pos) {
int testLength = out.length() + inWild.length();
if (stepover == 0 && matchSet.size() == 0 && out.length() > 8 && testLength == tokenLength) {//candidate generator
inWild[0] = '?';
matchRegExpHelper(w, inWild, matchSet, out, level, 0, stepover+1);
} else
return; //giveup on this path
}
if (inWild[0] == '?' || (inWild[0] == '*' && (out.length() + inWild.length() ) == level ) ) { //wild
matchRegExpHelper(w->alpha[pos].next, inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover);//follow path -> if ontology is full, treat '*' like a '?'
} else if (inWild[0] == '*')
matchRegExpHelper(w->alpha[pos].next, '*'+inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover); //keep adding chars
if (inWild[0] == w->alpha[pos].letter) //follow self
matchRegExpHelper(w->alpha[pos].next, inWild.substr(1), matchSet, out+w->alpha[pos].letter, level, 0, stepover); //follow char
matchRegExpHelper(w, inWild, matchSet, out, level, pos+1, stepover);//check next path
}
}
Error Message:
+str "Attempt to access index 1 in a vector of size 1." std::basic_string<char,std::char_traits<char>,std::allocator<char> >
+err {msg="Attempt to access index 1 in a vector of size 1." } ErrorException
Note: this function works fine for hundreds of test strings with '*' wilds if the extra stepover gate is not used
Semi-Solved: I place a pos < w->alpha.size() condition on each path that calls w->alpha[pos]... - this prevented the backtrack calls from attempting to access the vector with an out of bounds index value. Still have other issues to work out - it loops infinitely adding the ? and backtracking to remove it, then repeat. But, moving forward now.
Revised question: why during backtracking is the position index accumulating and/or not deincrementing - so at somepoint it calls w->alpha[pos]... with an invalid position that is either remaining from the next node or somehow incremented pos+1 when passing upward?