tokenize - Page 2 - Developer IT

Antlr Lexer Quoted String Predicate

- by Loki

I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following: STRING: QUOTE (options {greedy=false;} : . )* QUOTE ; WS : SPACE+ { $channel = HIDDEN; } ; WORD : ~(QUOTE|SPACE)+ ; For the corner cases, it needs to parse: "string" word1" word2 As three tokens: "string" as STRING and word1" and word2 as WORD. Basically, if there is a last quote, it needs to be part of the WORD were it is. If the quote is surrounded by white spaces, it should be a WORD. I tried this rule for WORD, without success: WORD: ~(QUOTE|SPACE)+ | (~(QUOTE|SPACE)* QUOTE ~QUOTE*)=> ~(QUOTE|SPACE)* QUOTE ~(QUOTE|SPACE)* ;

Read the article

Java parsing UTF8

- by Jack

I have the following issue with a UTF8 files structured as following: FIELD1§FIELD2§FIELD3§FIELD4 Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8, but it's strange because A7 7F so 1 byte shouldn't be enough to codify §. So I tried using directly a BufferedReader with a specified charset: BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8)) but when I try to tokenize the string with SmartTokenizer st = new SmartTokenizer(toTokenize, "§") (the SmartTokenizer is a modified version of the StringTokenizer that keeps empty tokens) no splitting occurs, and if I try to print the string I obtain FIELD1?FIELD2?FIELD3?... so § used in the file is different from the one specified as a the delimiter, and it's not able to print out it too. So what's the problem here? Maybe the original file should use 2 bytes to store §?

Read the article

translate by replacing words inside existing text

- by Berry Tsakala

What are common approaches for translating certain words (or expressions) inside a given text, when the text must be reconstructed (with punctuations and everythin.) ? The translation comes from a lookup table, and covers words, collocations, and emoticons like L33t, CUL8R, :-), etc. Simple string search-and-replace is not enough since it can replace part of longer words (cat dog ? caterpillar dogerpillar). Assume the following input: s = "dogbert, started a dilbert dilbertion proces cat-bert :-)" after translation, i should receive something like: result = "anna, started a george dilbertion process cat-bert smiley" I can't simply tokenize, since i loose punctuations and word positions. Regular expressions, works for normal words, but don't catch special expressions like the smiley :-) but it does . re.sub(r'\bword\b','translation',s) ==> translation re.sub(r'\b:-\)\b','smiley',s) ==> :-) for now i'm using the above mentioned regex, and simple replace for the non-alphanumeric words, but it's far from being bulletproof. (p.s. i'm using python)

Read the article

How can I access a global pointer outside of a C function?

- by patrick

I am trying to access the data of*tkn within a different function in my program for example: putchar(*tkn); It is a global variable but its not working correctly. Any ideas? #define MAX 20 // globals char *tkn; char array[MAX]; ... void tokenize() { int i = 0, j = 0; char *delim = " "; tkn = strtok (str," "); // get token 1 if (tkn != NULL) { printf("token1: "); while ((*tkn != 0) && (tkn != NULL)) { putchar(*tkn); array[i] = *tkn; *tkn++; i++; } } }

Read the article

distinguishing a string with flex

- by haluk

Hi, I need to tokenize some strings which will be splitted of according to operators like = and !=. I was successful using regex until the string has != operator. In my case, string was seperated into two parts, which is expected but ! mark is in the left side even it is part of given operator. Therefore, I believe that regex is not suitable for it and I want to benefit from lex. Since I do not have enough knowledge and experience with lex, I am not sure whether it fits my work or not. Basically, I am trying to do replace the right hand side of the operators with actual values from other data. Do you people think that can it be helpful for my case? Thanks.

Read the article

Question marks in Java generics.

- by gnucom

I tried to make sure this wasn't a duplicate post, sorry if I was blind. This is a small snippet of code taken from some of the examples that accompany the Stanford Parser. I've been developing in Java for about 4 years, but have never had a very strong understanding of what this style of code is supposed to indicate. List<? extends HasWord> wordList = toke.tokenize(); I'm not worried about the details of the code. What I'm confusing about is what exactly the generic expression is supposed to convey, in English. Can someone explain this to me? Thanks!

Read the article

Python finding index in a array

- by NIH

I am trying to see if a company from a list of companies is in a line in a file. If it is I utilize the index of that company to increment a variable in another array. The following is my python code. I keep getting the following error: AttributeError: 'set' object has no attribute 'index'. I cannot figure out what is going wrong and think the error is the line that is surrounded by **. companies={'white house black market', 'macy','nordstrom','filene','walmart'} positives=[0 for x in xrange(len(companies))] negatives=[0 for x in xrange(len(companies))] for line in f: for company in companies: if company in line.lower(): words=tokenize.word_tokenize(line) bag=bag_of_words(words) classif=classifier.classify(bag) if classif=='pos': **indice =companies.index(company)** positives[indice]+=1 elif classif=='neg': **indice =companies.index(company)** negatives[indice]+=1

Read the article

ODI 11g - Cleaning control characters and User Functions

- by David Allan

In ODI user functions have a poor name really, they should be user expressions - a way of wrapping common expressions that you may wish to reuse many times - across many different technologies is an added bonus. To illustrate look at the problem of how to remove control characters from text. Users ask these types of questions over all technologies - Microsoft SQL Server, Oracle, DB2 and for many years - how do I clean a string, how do I tokenize a string and so on. After some searching around you will find a few ways of doing this, in Oracle there is a convenient way of using the TRANSLATE and REPLACE functions. So you can convert some text using the following SQL; replace( translate('This is my string'||chr(9)||' which has a control character', chr(3)||chr(4)||chr(5)||chr(9), chr(3) ), chr(3), '' ) If you had many columns to perform this kind of transformation on, in the Oracle database the natural solution you'd go to would be to code this as a PLSQL function since you don't want the code splattered everywhere. Someone tells you that there is another control character that needs added equals a maintenance headache. Coding it as a PLSQL function will incur a context switch between SQL and PLSQL which could prove costly. In ODI user functions let you capture this expression text and reference it many times across your mappings. This will protect the expression from being copy-pasted by developers and make maintenance much simpler - change the expression definition in one place. Firstly define a name and a syntax for the user function, I am calling it UF_STRIP_BAD_CHARACTERS and it has one parameter an input string; We then can define an implementation for each technology we will use it, I will define Oracle's using the inputString parameter and the TRANSLATE and REPLACE functions with whatever control characters I want to replace; I can then use this inside mapping expressions in ODI, below I am cleaning the ENAME column - a fabricated example but you get the gist. Note when I use the user function the function name remains in the text of the mapping, the actual expression is not substituted until I generate the scenario. If you generate the scenario and export the scenario you can have a peak at the code that is processed in the runtime - below you can see a snippet of my export scenario; That's all for now, hopefully a useful snippet of info.

Read the article

Tokenizing a string with unequal number of spaces between fields

- by gatechgrad

I am tryint to tokenize entries from a file. However I am not able to use the line.split("") option because of unequal number of spaces between files. I am copying a few lines from my file below: "08-09-2010 21:21:46 00:22:7f:a6:9b:69 -79" "08-09-2010 21:21:46 04:4f:aa:b4:49:49 -79" "08-09-2010 21:21:46 04:4f:aa:31:4e:59 tikona 18002090044 -83" "08-09-2010 21:21:46 00:22:7f:26:9b:69 tikona 18002090044 -74" "08-09-2010 21:21:46 04:4f:aa:34:0d:c9 tikona 18002090044 -82" "08-09-2010 21:21:46 04:4f:aa:71:4e:59 -85" "08-09-2010 21:21:46 04:4f:aa:34:21:89 tikona 18002090044 -75" "08-09-2010 21:21:46 04:4f:aa:34:49:49 tikona 18002090044 -77" "08-09-2010 21:21:46 04:4f:aa:74:0d:c9 -85" "08-09-2010 21:22:47 18 APs were seen " I need to access the first column (which is a datetime object) the second column (00:22...) and the last column (-79 etc.). I have no trouble accessing the first and second columns, but not the last column. When I do a info=line.spilt(""), since the third column might or might no entries, I am not able to determine the token number. How do i access the 4th column? Is there a way i can use info[i].contains(" -")?

Read the article

valgrind complains doing a very simple strtok in c

- by monkeyking

Hi I'm trying to tokenize a string by loading an entire file into a char[] using fread. For some strange reason it is not always working, and valgrind complains in this very small sample program. Given an input like test.txt first second And the following program #include <stdio.h> #include <string.h> #include <stdlib.h> #include <sys/stat.h> //returns the filesize in bytes size_t fsize(const char* fname){ struct stat st ; stat(fname,&st); return st.st_size; } int main(int argc, char *argv[]){ FILE *fp = NULL; if(NULL==(fp=fopen(argv[1],"r"))){ fprintf(stderr,"\t-> Error reading file:%s\n",argv[1]); return 0; } char buffer[fsize(argv[1])]; fread(buffer,sizeof(char),fsize(argv[1]),fp); char *str = strtok(buffer," \t\n"); while(NULL!=str){ fprintf(stderr,"token is:%s with strlen:%lu\n",str,strlen(str)); str = strtok(NULL," \t\n"); } return 0; } compiling like gcc test.c -std=c99 -ggdb running like ./a.out test.txt thanks

Read the article

Parsing/Tokenizing a String Containing a SQL Command

- by Alan Storm

Are there any open source libraries (any language, python/PHP preferred) that will tokenize/parse an ANSI SQL string into its various components? That is, if I had the following string SELECT a.foo, b.baz, a.bar FROM TABLE_A a LEFT JOIN TABLE_B b ON a.id = b.id WHERE baz = 'snafu'; I'd get back a data structure/object something like //fake PHPish $results['select-columns'] = Array[a.foo,b.baz,a.bar]; $results['tables'] = Array[TABLE_A,TABLE_B]; $results['table-aliases'] = Array[a=TABLE_A, b=TABLE_B]; //etc... Restated, I'm looking for the code in a database package that teases the SQL command apart so that the engine knows what to do with it. Searching the internet turns up a lot of results on how to parse a string WITH SQL. That's not what I want. I realize I could glop through an open source database's code to find what I want, but I was hoping for something a little more ready made, (although if you know where in the MySQL, PostgreSQL, SQLite source to look, feel free to pass it along) Thanks!

Read the article

More efficient approach to XSLT for-each

- by Paul

I have an XSLT which takes a . delimted string and splits it into two fields for a SQL statement: <xsl:for-each select="tokenize(Path,'\.')"> <xsl:choose> <xsl:when test="position() = 1 and position() = last()">SITE = '<xsl:value-of select="."/>' AND PATH = ''</xsl:when> <xsl:when test="position() = 1 and position() != last()">SITE = '<xsl:value-of select="."/>' </xsl:when> <xsl:when test="position() = 2 and position() = last()">AND PATH = '<xsl:value-of select="."/>' </xsl:when> <xsl:when test="position() = 2">AND PATH = '<xsl:value-of select="."/></xsl:when> <xsl:when test="position() > 2 and position() != last()">.<xsl:value-of select="."/></xsl:when> <xsl:when test="position() > 2 and position() = last()">.<xsl:value-of select="."/>' </xsl:when> <xsl:otherwise>zxyarglfaux</xsl:otherwise> </xsl:choose> </xsl:for-each> The results are as follows: INPUT: North OUTPUT: SITE = 'North' AND PATH = '' INPUT: North.A OUTPUT: SITE = 'North' AND PATH = 'A' INPUT: North.A.B OUTPUT: SITE = 'North' AND PATH = 'A.B' INPUT: North.A.B.C OUTPUT: SITE = 'North' AND PATH = 'A.B.C' This works, but is very lengthy. Can anyone see a more efficient approach? Thanks!

Read the article

Return a list of imported Python modules used in a script?

- by Jono Bacon

Hi All, I am writing a program that categorizes a list of Python files by which modules they import. As such I need to scan the collection of .py files ad return a list of which modules they import. As an example, if one of the files I import has the following lines: import os import sys, gtk I would like it to return: ["os", "sys", "gtk"] I played with modulefinder and wrote: from modulefinder import ModuleFinder finder = ModuleFinder() finder.run_script('testscript.py') print 'Loaded modules:' for name, mod in finder.modules.iteritems(): print '%s ' % name, but this returns more than just the modules used in the script. As an example in a script which merely has: import os print os.getenv('USERNAME') The modules returned from the ModuleFinder script return: tokenize heapq __future__ copy_reg sre_compile _collections cStringIO _sre functools random cPickle __builtin__ subprocess cmd gc __main__ operator array select _heapq _threading_local abc _bisect posixpath _random os2emxpath tempfile errno pprint binascii token sre_constants re _abcoll collections ntpath threading opcode _struct _warnings math shlex fcntl genericpath stat string warnings UserDict inspect repr struct sys pwd imp getopt readline copy bdb types strop _functools keyword thread StringIO bisect pickle signal traceback difflib marshal linecache itertools dummy_thread posix doctest unittest time sre_parse os pdb dis ...whereas I just want it to return 'os', as that was the module used in the script. Can anyone help me achieve this?

Read the article

Return a list of important Python modules in a script?

- by Jono Bacon

Hi All, I am writing a program that categorizes a list of Python files by which modules they import. As such I need to scan the collection of .py files ad return a list of which modules they import. As an example, if one of the files I import has the following lines: import os import sys, gtk I would like it to return: ["os", "sys", "gtk"] I played with modulefinder and wrote: from modulefinder import ModuleFinder finder = ModuleFinder() finder.run_script('testscript.py') print 'Loaded modules:' for name, mod in finder.modules.iteritems(): print '%s ' % name, but this returns more than just the modules used in the script. As an example in a script which merely has: import os print os.getenv('USERNAME') The modules returned from the ModuleFinder script return: tokenize heapq __future__ copy_reg sre_compile _collections cStringIO _sre functools random cPickle __builtin__ subprocess cmd gc __main__ operator array select _heapq _threading_local abc _bisect posixpath _random os2emxpath tempfile errno pprint binascii token sre_constants re _abcoll collections ntpath threading opcode _struct _warnings math shlex fcntl genericpath stat string warnings UserDict inspect repr struct sys pwd imp getopt readline copy bdb types strop _functools keyword thread StringIO bisect pickle signal traceback difflib marshal linecache itertools dummy_thread posix doctest unittest time sre_parse os pdb dis ...whereas I just want it to return 'os', as that was the module used in the script. Can anyone help me achieve this?

Read the article

How toget a list of "fastest miles" from a set of GPS Points

- by santiagobasulto

I'm trying to solve a weird problem. Maybe you guys know of some algorithm that takes care of this. I have data for a cargo freight truck and want to extract some data. Suppose I've got a list of sorted points that I get from the GPS. That's the route for that truck: [ { "lng": "-111.5373066", "lat": "40.7231711", "time": "1970-01-01T00:00:04Z", "elev": "1942.1789265256325" }, { "lng": "-111.5372056", "lat": "40.7228762", "time": "1970-01-01T00:00:07Z", "elev": "1942.109892409177" } ] Now, what I want to get is a list of the "fastest miles". I'll do an example: Given the points: A, B, C, D, E, F the distance from point A to point B is 1 mile, and the cargo took 10:32 minutes. From point B to point D i've got other mile, and the cargo took 10 minutes, etc. So, i need a list sorted by time. Similar to: B -> D: 10 A -> B: 10:32 D -> F: 11:02 Do you know any efficient algorithm that let me calculate that? Thank you all. PS: I'm using Python. EDIT: I've got the distance. I know how to calculate it and there are plenty of posts to do that. What I need is an algorithm to tokenize by mile and get speed from that. Having a distance function is not helpful enough: results = {} for point in points: aux_points = points.takeWhile(point>n) #This doesn't exist, just trying to be simple for aux_point in aux_points: d = distance(point, aux_point) if d == 1_MILE: time_elapsed = time(point, aux_point) results[time_elapsed] = (point, aux_point) I'm still doing some pretty inefficient calculations.

Read the article

XSLT 1.0 : Iterate over characters in a string

- by subtenante

I need to iterate over the characters in a string to build an XML structure. Currently, I am doing this : <xsl:template name="verticalize"> <xsl:param name="text">Some text</xsl:param> <xsl:for-each select="tokenize(replace(replace($text,'(.)','$1\\n'),'\\n$',''),'\\n')"> <xsl:element name="para"> <xsl:value-of select="."/> </xsl:element> </xsl:for-each> </xsl:template> This produces something like : <para>S</para> <para>o</para> <para>m</para> <para>e</para> <para> </para> <para>t</para> <para>e</para> <para>x</para> <para>t</para> This works fine with Xpath 2.0. But I need to apply the same treatment in a XPath 1.0 environment, where the replace() method is not available. Do you know a way to achieve this ?

Read the article

Adroid's DateFormat replacement - missing the format() with FieldPosition

- by user331244

Hi, I need to split a date string into pieces and I'm doing it using the public final StringBuffer format (Object object, StringBuffer buffer, FieldPosition field) from the java.text.DateFormat class. However, the implementation of this function is really slow, hence Android has an own implementation in android.text.format.DateFormat. BUT, in my case, I want to extract the different pieces of the date string (year, minute and so on). Since I need to be locale independent, I can not use SimpleDateFormat and custom strings. I do it as follows: Calendar c = ... // find out what field to extract int field = getField(); // Create a date string Field calendarField = DateFormat.Field.ofCalendarField(field); FieldPosition fieldPosition = new FieldPosition(calendarField); StringBuffer label = new StringBuffer(); label = getDateFormat().format(c.getTime(), label, fieldPosition); // Find the piece that we are looking for int beginIndex = fieldPosition.getBeginIndex(); int endIndex = fieldPosition.getEndIndex(); String asString = label.substring(beginIndex, endIndex); For some reason, the format() overload with the FieldPosition argument is not included in the android platform. Any ideas of how to do this in another way? Is there any easy way to tokenize the pattern string? Any other ideas?

Read the article

Boost binding a function taking a reference

- by Jamie Cook

Hi all, I am having problems compiling the following snippet int temp; vector<int> origins; vector<string> originTokens = OTUtils::tokenize(buffer, ","); // buffer is a char[] array // original loop BOOST_FOREACH(string s, originTokens) { from_string(temp, s); origins.push_back(temp); } // I'd like to use this to replace the above loop std::transform(originTokens.begin(), originTokens.end(), origins.begin(), boost::bind<int>(&FromString<int>, boost::ref(temp), _1)); where the function in question is // the third parameter should be one of std::hex, std::dec or std::oct template <class T> bool FromString(T& t, const std::string& s, std::ios_base& (*f)(std::ios_base&) = std::dec) { std::istringstream iss(s); return !(iss >> f >> t).fail(); } the error I get is 1>Compiling with Intel(R) C++ 11.0.074 [IA-32]... (Intel C++ Environment) 1>C:\projects\svn\bdk\Source\deps\boost_1_42_0\boost/bind/bind.hpp(303): internal error: assertion failed: copy_default_arg_expr: rout NULL, no error (shared/edgcpfe/il.c, line 13919) 1> 1> return unwrapper<F>::unwrap(f, 0)(a[base_type::a1_], a[base_type::a2_]); 1> ^ 1> 1>icl: error #10298: problem during post processing of parallel object compilation Google is being unusually unhelpful so I hope that some one here can provide some insights.

Read the article

Implemeting web-based autocomplete for structured input

- by ravigad

I am looking to implement web-based autocomplete for structured commands. So, for example, suppose I was trying to implement a web-based Windows command line with autocomplete, in such a case I would want to suggest all the available commands, then, once a user selects a command, all the options for that command, then if a user enters a switch (say '-' dash or '/' slash) then offer all the switches for that command and all the relevant values that can follow, and so on. All from the same text input box. If you have used version 4.0 of JIRA, I am thinking of something similar to the JQL search input box that they have implemented. I have not managed to find any tutorials that look at this scenario end-to-end (which is a shame, because it would be great to see more applications that do this). What I am looking for is some guidance on the steps needed to implement this solution: Do you have any experience implementing such a solution and if so what components did you use? Would you use a framework such as ANTLR to provide the available options to the end user? If not what would do? Sorry for raising such a general question, but my main problem is working out how the pieces fit together as opposed to, say, how to do autocomplete or how to parse/tokenize an input command...

Read the article

Good C string libary

- by chamakits

Hello all. I recently got inspired to start up a project I've been wanting to code for a while. I want to do it in C, because memory handling is key this application. I was searching around for a good implementation of strings in C, since I know me doing it myself could lead to some messy buffer overflows, and I expect to be dealing with a fairly big amount of strings. I found this article which gives details on each, but they each seem like they have a good amount of cons going for them (don't get me wrong, this article is EXTREMELY helpful, but it still worries me that even if I were to choose one of those, I wouldn't be using the best I can get). I also don't know how up to date the article is, hence my current plea. What I'm looking for is something that may hold a large amount of characters, and simplifies the process of searching through the string. If it allows me to tokenize the string in any way, even better. Also, it should have some pretty good I/O performance. Printing, and formatted printing isn't quite a top priority. I know I shouldn't expect a library to do all the work for me, but was just wandering if there was a well documented string function out there that could save me some time and some work. Any help is greatly appreciated. Thanks in advance! EDIT: I was asked about the license I prefer. Any sort of open source license will do, but preferably GPL (v2 or v3). EDIt2: I found betterString (bstring) library and it looks pretty good. Good documentation, small yet versatile amount of functions, and easy to mix with c strings. Anyone have any good or bad stories about it? The only downside I've read about it is that it lacks Unicode (again, read about this, haven't seen it face to face just yet), but everything else seems pretty good. EDIT3: Also, preferable that its pure C.

Read the article

JavaCC: Please help me understand token ambiguity.

- by java.is.for.desktop

Hello, everyone! I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example: I want to parse XML processing instruction. The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag. So, lets define this in JavaCC: (I use lexical states, in this case DEFAULT and PROC_INST) TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) > TOKEN : <WSS : (" " | "\t")+ > // WSS = whitespaces <DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST} <PROC_INST> TOKEN : {<PI_TARGET : <NAME> >} <PROC_INST> TOKEN : {<PI_DATA : ~[] >} // accept everything <PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT} Now the part which recognizes processing instructions: void PROC_INSTR() : {} { ( <PI_START> (t=<PI_TARGET>){System.out.println("target: " + t.image);} <WSS> (t=<PI_DATA>){System.out.println("data: " + t.image);} <PI_END> ) {} } The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. Let's test it with <?mytarget here-goes-some-data?>: The target is recognized: "target: mytarget". But now I get my favorite JavaCC parsing error: !! procinstparser.ParseException: Encountered "" at line 1, column 15. !! Was expecting one of: !! Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC! I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse? What I would need is telling JavaCC to recognize "everything until ?>" as processing instruction data. How can it be done? Thank you.

Read the article

Performance of tokenizing CSS in PHP

- by Boldewyn

This is a noob question from someone who hasn't written a parser/lexer ever before. I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft). It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings. My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project. Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into? I made some optimizations, like checking for '@', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts. My code (snippet) so far: $TOKENS = array( 'IDENT' => '...regexp...', 'ATKEYWORD' => '@...regexp...', 'String' => '"...regexp..."|\'...regexp...\'', //... ); $string = '...CSS source string...'; $stream = array(); // we reduce $string token by token while ($string != '') { $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the // start is insignificant but doing a trim reduces exec time by 25% $matches = array(); // loop through all possible tokens foreach ($TOKENS as $t => $p) { // The '&' is used as delimiter, because it isn't used anywhere in // the token regexps if (preg_match('&^'.$p.'&Su', $string, $matches)) { $stream[] = array($t, $matches[0]); $string = substr($string, strlen($matches[0])); // Yay! We found one that matches! continue 2; } } // if we come here, we have a syntax error and handle it somehow } // result: an array $stream consisting of arrays with // 0 => type of token // 1 => token content

Read the article

JavaCC: How can one exclude a string from a token? (A.k.a. understanding token ambiguity.)

- by java.is.for.desktop

Hello, everyone! I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example: I want to parse XML processing instruction. The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag. So, lets define this in JavaCC: (I use lexical states, in this case DEFAULT and PROC_INST) TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) > TOKEN : <WSS : (" " | "\t")+ > // WSS = whitespaces <DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST} <PROC_INST> TOKEN : {<PI_TARGET : <NAME> >} <PROC_INST> TOKEN : {<PI_DATA : ~[] >} // accept everything <PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT} Now the part which recognizes processing instructions: void PROC_INSTR() : {} { ( <PI_START> (t=<PI_TARGET>){System.out.println("target: " + t.image);} <WSS> (t=<PI_DATA>){System.out.println("data: " + t.image);} <PI_END> ) {} } Let's test it with <?mytarget here-goes-some-data?>: The target is recognized: "target: mytarget". But now I get my favorite JavaCC parsing error: !! procinstparser.ParseException: Encountered "" at line 1, column 15. !! Was expecting one of: !! Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC! I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse? The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. I should tell JavaCC to recognize "everything except ?>" as processing instruction data. But how can it be done? NOTE: I can only exclude single characters using ~["a"|"b"|"c"], I can't exclude strings such as ~["abc"] or ~["?>"]. Another great anti-feature of JavaCC. Thank you.

Read the article

overriding ctype<wchar_t>

- by Potatoswatter

I'm writing a lambda calculus interpreter for fun and practice. I got iostreams to properly tokenize identifiers by adding a ctype facet which defines punctuation as whitespace: struct token_ctype : ctype<char> { mask t[ table_size ]; token_ctype() : ctype<char>( t ) { for ( size_t tx = 0; tx < table_size; ++ tx ) { t[tx] = isalnum( tx )? alnum : space; } } }; (classic_table() would probably be cleaner but that doesn't work on OS X!) And then swap the facet in when I hit an identifier: locale token_loc( in.getloc(), new token_ctype ); … locale const &oldloc = in.imbue( token_loc ); in.unget() >> token; in.imbue( oldloc ); There seems to be surprisingly little lambda calculus code on the Web. Most of what I've found so far is full of unicode ? characters. So I thought to try adding Unicode support. But ctype<wchar_t> works completely differently from ctype<char>. There is no master table; there are four methods do_is x2, do_scan_is, and do_scan_not. So I did this: struct token_ctype : ctype< wchar_t > { typedef ctype<wchar_t> base; bool do_is( mask m, char_type c ) const { return base::do_is(m,c) || (m&space) && ( base::do_is(punct,c) || c == L'?' ); } const char_type* do_is (const char_type* lo, const char_type* hi, mask* vec) const { base::do_is(lo,hi,vec); for ( mask *vp = vec; lo != hi; ++ vp, ++ lo ) { if ( *vp & punct || *lo == L'?' ) *vp |= space; } return hi; } const char_type *do_scan_is (mask m, const char_type* lo, const char_type* hi) const { if ( m & space ) m |= punct; hi = do_scan_is(m,lo,hi); if ( m & space ) hi = find( lo, hi, L'?' ); return hi; } const char_type *do_scan_not (mask m, const char_type* lo, const char_type* hi) const { if ( m & space ) { m |= punct; while ( * ( lo = base::do_scan_not(m,lo,hi) ) == L'?' && lo != hi ) ++ lo; return lo; } return base::do_scan_not(m,lo,hi); } }; (Apologies for the flat formatting; the preview converted the tabs differently.) The code is WAY less elegant. I does better express the notion that only punctuation is additional whitespace, but that would've been fine in the original had I had classic_table. Is there a simpler way to do this? Do I really need all those overloads? (Testing showed do_scan_not is extraneous here, but I'm thinking more broadly.) Am I abusing facets in the first place? Is the above even correct? Would it be better style to implement less logic?

Read the article

XSLT: look for a word and it's context

- by farhad

Hello! I need to search a word and it's context into an xml. For example <line>hello world, my name is farhad and i'm having trouble with xslt</line> looking for 'and', context of 3 words: <line>hello world, my name is farhad and i'm having trouble with xslt</line> How can i do? I wrote some xslt to find the word, but i can't go back 3 words to set span. This is my xslt: <xsl:variable name="delimiters">[,.;!?\s"()]+</xsl:variable> <xsl:template match="/"> <xsl:apply-templates select="//line"/> </xsl:template> <xsl:template match="line"> <line> <xsl:for-each select="tokenize(.,'\s')"> <xsl:choose>  <xsl:when test="compare(replace(.,$delimiters,'$1'),'red') = 0"> <xsl:value-of select="."/> </xsl:when> <xsl:otherwise> <xsl:value-of select="."/> <xsl:choose> <xsl:when test="position()=last()"> <xsl:text></xsl:text> </xsl:when> <xsl:otherwise> <xsl:text> </xsl:text> </xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:for-each> </line><xsl:text> </xsl:text> </xsl:template> This is an example xml: http://pastebin.com/eAVM9CDQ . I have to search for context also on preceding tags, for example: <line>hello world,</line> <line>my name</line> <line>is farhad </line> <line>and i'm having</line> <line>trouble with xslt</line> so, looking for 'and', context of 3 words: <line>hello world,</line> <line>my name</line> <line>is farhad </line> <line>and i'm having</line> <line>trouble with xslt</line> with problems of overlapping, but now it's not a problem (i think i know how manage it). How can i search a word and its context? Thank you very much.

Search Results

Search found 56 results on 3 pages for 'tokenize'.

Page 2/3 | < Previous Page | 1 2 3 | Next Page >

- by Loki

- by Jack

- by Berry Tsakala

- by patrick

- by haluk

- by gnucom

- by NIH

- by David Allan

- by gatechgrad

- by monkeyking

- by Alan Storm

- by Paul

- by Jono Bacon

- by Jono Bacon

- by santiagobasulto

- by subtenante

- by user331244

- by Jamie Cook

- by ravigad

- by chamakits

- by java.is.for.desktop

- by Boldewyn

- by java.is.for.desktop

- by Potatoswatter

- by farhad

< Previous Page | 1 2 3 | Next Page >