tokenize - Developer IT

Lucene.Net support phrases?: What is best approach to tokenize comma-delimited data (atomically) in

- by Pete Alvin

I have a database with a column I wish to index that has comma-delimited names, e.g., User.FullNameList = "Helen Ready, Phil Collins, Brad Paisley" I prefer to tokenize each name atomically (name as a whole searchable entity). What is the best approach for this? Did I miss a simple option to set the tokenize delimiter? Do I have to subclass or write my own class that to roll my own tokenizer? Something else? ;) Or does Lucene.net not support phrases? Or is it smart enough to handle this use case automatically? I'm sure I'm not the first person to have to do this. Googling produced no noticeable solutions.

Read the article

Tokenize a command string

- by pocoa

I have string like this: command "http://www.mysite.com" some_param="string param" some_param2=50 I want to tokenize this string into: command "http://www.mysite.com" some_param="string param" some_param2=50 I know it's possible to split with spaces but these parameters can also be seperated by commas, like: command "http://www.mysite.com", some_param="string param", some_param2=50 I tried to do it like this: \w+\=?\"?.+\"? but it didn't work.

Read the article

Tokenize problem in Java with separator ". "

- by user112976

I need to split a text using the separator ". ". For example I want this string : Washington is the U.S Capital. Barack is living there. To be cut into two parts: Washington is the U.S Capital. Barack is living there. Here is my code : // Initialize the tokenizer StringTokenizer tokenizer = new StringTokenizer("Washington is the U.S Capital. Barack is living there.", ". "); while (tokenizer.hasMoreTokens()) { System.out.println(tokenizer.nextToken()); } And the output is unfortunately : Washington is the U S Capital Barack is living there Can someone explain what's going on?

Read the article

How do I tokenize a string in C++?

- by Bill the Lizard

Java has a convenient split method: String str = "The quick brown fox"; String[] results = str.split(" "); Is there an easy way to do this in C++?

Read the article

How do I get the next token in a Cstring if I want to use it as an int? (c++)

- by Van

My objective is to take directions from a user and eventually a text file to move a robot. The catch is that I must use Cstrings(such as char word[];) rather than the std::string and tokenize them for use. the code looks like this: void Navigator::manualDrive() { char uinput[1]; char delim[] = " "; char *token; cout << "Enter your directions below: \n"; cin.ignore(); cin.getline (uinput, 256); token=strtok(uinput, delim); if(token == "forward") { int inches; inches=token+1; travel(inches); } } I've never used Cstrings I've never tokenized anything before, and I don't know how to write this. Our T.A.'s expect us to google and find all the answers because they are aware we've never been taught these methods. Everyone in my lab is having much more trouble than usual. I don't know the code to write but I know what I want my program to do. I want it to execute like this: 1) Ask for directions. 2) cin.getline the users input 3) tokenize the inputed string 4) if the first word token == "forward" move to the next token and find out how many inches to move forward then move forward 5) else if the first token == "turn" move to the next token. if the next token == "left" move to the next token and find out how many degrees to turn left I will have to do this for forward x, backward x, turn left x, turn right x, and stop(where x is in inches or degrees). I already wrote functions that tell the robot how to move forward an inch and turn in degrees. I just need to know how to convert the inputted strings to all lowercase letters and move from token to token and convert or extract the numbers from the string to use them as integers. If all is not clear you can read my lab write up at this link: http://www.cs.utk.edu/~cs102/robot_labs/Lab9.html If anything is unclear please let me know, and I will clarify as best I can.

Read the article

SWI-Prolog tokenize_atom/2 replacement?

- by Shark

What I need to do is to break atom to tokens. E. g.: tokenize_string('Hello, World!', L). would unify L=['Hello',',','World','!']. Exactly as tokenize_atom/2 do. But when I try to use tokenize_atom/2 with non-latin letters it fails. Is there any universal replacement or how I can write one? Thanks in advance.

Read the article

Tokenizing Twitter Posts in Lucene

- by Amaç Herdagdelen

Hello, My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag? My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example, String newText = newText.replaceAll("#", "hashtag"); newText = newText.replaceAll("@", "addresstag"); Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense? Thanks in advance! Amaç

Read the article

Difference between ruby StringScanner post_match and rest?

- by zhon

What is the difference between ruby's StringScanner#post_match and StringScanner#rest? scanner = StringScanner.new('Say hello to...') scanner.scan(/\w+/) #=> 'Say' scanner.scan(/\s+/) #=> ' ' scanner.rest #=> 'hello to...' scanner.post_match #=> 'hello to...' scanner.rest.class #=> String scanner.post_match.class #=> String I don't see a difference. It seems like both return a string containing everything after the match.

Read the article

Macro and array crossing

- by Thomas

I am having a problem with a lisp macro. I would like to create a macro which generate a switch case according to an array. Here is the code to generate the switch-case: (defun split-elem(val) `(,(car val) ',(cdr val))) (defmacro generate-switch-case (var opts) `(case ,var ,(mapcar #'split-elem opts))) I can use it with a code like this: (generate-switch-case onevar ((a . A) (b . B))) But when I try to do something like this: (defparameter *operators* '((+ . OPERATOR-PLUS) (- . OPERATOR-MINUS) (/ . OPERATOR-DIVIDE) (= . OPERATOR-EQUAL) (* . OPERATOR-MULT))) (defmacro tokenize (data ops) (let ((sym (string->list data))) (mapcan (lambda (x) (generate-switch-case x ops)) sym))) (tokenize data *operators*) I got this error: *** - MAPCAR: A proper list must not end with OPS. But I don't understand why. When I print the type of ops I get SYMBOL I was expecting CONS, is it related? Also, for my function tokenize how many times the lambda is evaluated (or the macro expanded)?

Read the article

Declaring an array of character pointers (arg passing)

- by Isaac Copper

This is something that should be easy to answer, but is more difficult for me to find a particular right answer on Google or in K&R. I could totally be overlooking this, too, and if so please set me straight! The pertinent code is below: int main(){ char tokens[100][100]; char str = "This is my string"; tokenize(str, tokens); for(int i = 0; i < 100; i++){ printf("%s is a token\n", token[i]); } } void tokenize(char *str, char tokens[][]){ //do stuff with string and tokens, putting //chars into the token array like so: tokens[i][j] = <A CHAR> } So I realize that I can't have char tokens[][] in my tokenize function, but if I put in char **tokens instead, I get a compiler warning. Also, when I try to put a char into my char array with tokens[i][j] = <A CHAR>, I segfault. Where am I going wrong? (And in how many ways... and how can I fix it?) Thanks so much!

Read the article

Cure for puzzle piece programming habits?

- by Recursion

Even though I went to a decent CS school, I was still taught with the mentality of programming with puzzle pieces. By puzzle pieces I mean, looking up code segments at each step of the development process and adding them together as needed. Eventually gathering all of the pieces and having a properly working program. So as an example, if in my program the next step is to tokenize a string, I go to google and search "how do I tokenize a string in language". All instead of critically thinking about its implementation. I personally don't think its a very good way to program and I always seem to forget everything that I have searched for. So how can I get out of this puzzle piece mode of programmer that I was taught.

Read the article

Tokenizing a string with variable whitespace

- by Ron Holcomb

I've read through a few threads detailing how to tokenize strings, but I'm apparently too thick to adapt their suggestions and solutions into my program. What I'm attempting to do is tokenize each line from a large (5k+) line file into two strings. Here's a sample of the lines: 0 -0.11639404 9.0702948e-05 0.00012207031 0.0001814059 0.051849365 0.00027210884 0.062103271 0.00036281179 0.034423828 0.00045351474 0.035125732 The difference I'm finding between my lines and the other sample input from other threads is that I have a variable amount of whitespace between the parts that I want to tokenize. Anyways, here's my attempt at tokenizing: #include <iostream> #include <iomanip> #include <fstream> #include <string> using namespace std; int main(int argc, char *argv[]) { ifstream input; ofstream output; string temp2; string temp3; input.open(argv[1]); output.open(argv[2]); if (input.is_open()) { while (!input.eof()) { getline(input, temp2, ' '); while (!isspace(temp2[0])) getline(input, temp2, ' '); getline (input, temp3, '\n'); } input.close(); cout << temp2 << endl; cout << temp3 << endl; return 0; } I've clipped it some, since the troublesome bits are here. The issue that I'm having is that temp2 never seems to catch a value. Ideally, it should get populated with the first column of numbers, but it doesn't. Instead, it is blank, and temp3 is populated with the entire line. Unfortunately, in my course we haven't learned about vectors, so I'm not quite sure how to implement them in the other solutions for this I've seen, and I'd like to not just copy-paste code for assignments to get things work without actually understanding it. So, what's the extremely obvious/already been answered/simple solution I'm missing? I'd like to stick to standard libraries that g++ uses if at all possible.

Read the article

Cure for puzzle piece programming habbits?

- by Recursion

Even though I went to a decent CS school, I was still taught with the mentality of programming with puzzle pieces. By puzzle pieces I mean, looking up code segments at each step of the development process and adding them together as needed. Eventually gathering all of the pieces and having a properly working program. So as an example, if in my program the next step is to tokenize a string, I go to google and search "how do I tokenize a string in language". All instead of critically thinking about its implementation. I personally don't think its a very good way to program and I always seem to forget everything that I have searched for. So how can I get out of this puzzle piece mode of programmer that I was taught.

Read the article

Ruby : UTF-8 IO

- by subtenante

I use ruby 1.8.7. I try to parse some text files containing greek sentences, encoded in UTF-8. (I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.) I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file. Here is my code : #!/usr/bin/ruby -KU def prepare_line(l) l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|$\d+$\s*/u, "") end def tokenize(l) l.split /['·.;!:\s]+/u end $dict = {} $cpt = 0 $out = File.new 'out.txt', 'w' def lesson(file) $cpt = $cpt + 1 file.readlines.each do |l| $out.puts l l = prepare_line l tokenize(l).each do |t| unless $dict[t] $dict[t] = $cpt $out.puts " #{t}\n" end end end end Dir.new('etc/').each do |filename| f = File.new("etc/#{filename}") unless File.directory? f lesson f end end Here is part of my output : ?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa (Note that the puts l part seems to work fine, at the end of the given output line.) Any idea what is wrong with my code ? (General comments about ruby idioms I could use are very welcome, I'm really a beginner.)

Read the article

memcmp,strcmp,strncmp in C

- by el10780

I wrote this small piece of code in C to test memcmp() strncmp() strcmp() functions in C. Here is the code that I wrote: #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char** argv) { char *word1="apple",*word2="atoms"; if (strncmp(word1,word2,5)==0) printf("strncmp result.\n"); if (memcmp(word1,word2,5)==0) printf("memcmp result.\n"); if (strcmp(word1,word2)==0) printf("strcmp result.\n"); } Can somebody explain me the differences because I am confused with these three functions?My main problem is that I have a file in which I tokenize its line of it,the problem is that when I tokenize the word "atoms" in the file I have to stop the process of tokenizing.I first tried strcmp() but unfortunately when it reached to the point where the word "atoms" were placed in the file it didn't stop and it continued,but when I used either the memcmp() or the strncmp() it stopped and I was happy.But then I thought,what if there will be a case in which there is one string in which the first 5 letters are a,t,o,m,s and these are being followed by other letters.Unfortunately,my thoughts were right as I tested it using the above code by initializing word1 to "atomsaaaaa" and word2 to atoms and memcmp() and strncmp() in the if statements returned 0.On the other hand strcmp() it didn't.It seems that I must use strcmp(). I have done google searches but I got more confused as I have seen sites and other forums to define these three differently.If it is possible for someone to give me correct explanations/definitions so I can use them correctly in my source code,I would be really grateful.

Read the article

Good library for search text tokenization

- by Chris Dutrow

Looking to tokenize some text in the same or similar way in which a search engine would do it. The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a library in that language, but could probably set something up to use another language if necessary. Example Original token: We have some great burritos! More simplified: (remove plurals and punctuation) We have some great burrito Even more simplified: (remove superfluous words) great burrito Best: (recognize positive and negative meaning): burrito -positive-

Read the article

Simple C# Tokenizer Using Regex

- by Pete

I'm looking to tokenize really simple strings,but struggling to get the right Regex. The strings might look like this: string1 = "{[Surname]}, some text... {[FirstName]}" string2 = "{Item}foo.{Item2}bar" And I want to extract the tokens in the curly braces (so string1 gets "{[Surname]}","{[FirstName]}" and string2 gets "{Item}" and "{Item2}") this question is quite good, but I can't get the regex right: poor mans lexer for c# Thanks for the help!

Read the article

How to read values from file. tokenizer

- by user69514

I have a file in which each line contains two numbers. The problem is that the two number are separated by a space, but the space can be any number of blank spaces. either one, two, or more. I want to read the line and store each of the numbers in a variable, but I'm not sure how to tokenize it. i.e 1 5 3 2 5 6 3 4 83 54 23 23 32 88 8 203

Read the article

Is there a tokenizer for a cpp file

- by AJ

I have a cpp file with a huge class implementation. Now I have to modify the source file itself. For this, is there a library/api/tool that will tokenize this file for me and give me one token each time i request. My requirement is as below. OpenCPPFile() While (!EOF) token = GetNextToken(); process something based on this token EndWhile I am happy now Regards, AJ

Read the article

Splitting input into substrings in PIG (Hadoop)

- by Niels Basjes

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

Read the article

read a line in java

- by Ali_IT

i want the easiest way to read a line in java. after that i want to tokenize it.

Read the article

Evaluating mathematical expressions in Python

- by vander

Hi, I want to tokenize a given mathematical expression into a binary tree like this: ((3 + 4 - 1) * 5 + 6 * -7) / 2 '/' / \ + 2 / \ * * / \ / \ - 5 6 -7 / \ + 1 / \ 3 4 Is there any pure Python way to do this? Like passing as a string to Python and then get back as a tree like mentioned above. Thanks.

Read the article

A string tokenizer in C++ that allows multiple separators

- by Hao Wooi Lim

Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);

Read the article

Antlr Lexer Quoted String Problem

- by Loki

I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following: STRING: QUOTE (options {greedy=false;} : . )* QUOTE ; WS : SPACE+ { $channel = HIDDEN; } ; WORD : ~(QUOTE|SPACE)+ ; For the corner cases, it needs to parse: "string" word1" word2 As three tokens: "string" as STRING and word1" and word2 as WORD. Basically, if there is a last quote, it needs to be part of the WORD were it is. If the quote is surrounded by white spaces, it should be a WORD. I tried this rule for WORD, without success: WORD: ~(QUOTE|SPACE)+ | (~(QUOTE|SPACE)* QUOTE ~QUOTE*)=> ~(QUOTE|SPACE)* QUOTE ~(QUOTE|SPACE)* ;

Read the article

Resulting .exe from PyInstaller with wxPython crashing

- by Helgi Hrafn Gunnarsson

I'm trying to compile a very simple wxPython script into an executable by using PyInstaller on Windows Vista. The Python script is nothing but a Hello World in wxPython. I'm trying to get that up and running as a Windows executable before I add any of the features that the program needs to have. But I'm already stuck. I've jumped through some loops in regards to MSVCR90.DLL, MSVCP90.DLL and MSVCPM90.DLL, which I ended up copying from my Visual Studio installation (C:\Program Files\Microsoft Visual Studio 9.0\VC\redist\x86\Microsoft.VC90.CRT). As according to the instructions for PyInstaller, I run: Command: Configure.py Output: I: computing EXE_dependencies I: Finding TCL/TK... I: could not find TCL/TK I: testing for Zlib... I: ... Zlib available I: Testing for ability to set icons, version resources... I: ... resource update available I: Testing for Unicode support... I: ... Unicode available I: testing for UPX... I: ...UPX available I: computing PYZ dependencies... So far, so good. I continue. Command: Makespec.py -F guitest.py Output: wrote C:\Code\PromoUSB\guitest.spec now run Build.py to build the executable Then there's the final command. Command: Build.py guitest.spec Output: checking Analysis building Analysis because out0.toc non existent running Analysis out0.toc Analyzing: C:\Python26\pyinstaller-1.3\support\_mountzlib.py Analyzing: C:\Python26\pyinstaller-1.3\support\useUnicode.py Analyzing: guitest.py Warnings written to C:\Code\PromoUSB\warnguitest.txt checking PYZ rebuilding out1.toc because out1.pyz is missing building PYZ out1.toc checking PKG rebuilding out3.toc because out3.pkg is missing building PKG out3.pkg checking ELFEXE rebuilding out2.toc because guitest.exe missing building ELFEXE out2.toc I get the resulting 'guitest.exe' file, but upon execution, it "simply crashes"... and there is no debug info. It's just one of those standard Windows Vista crashes. The script itself, guitest.py runs just fine by itself. It only crashes as an executable, and I'm completely lost. I don't even know what to look for, since nothing I've tried has returned any relevant results. Another file is generated as a result of the compilation process, called 'warnguitest.txt'. Here are its contents. W: no module named posix (conditional import by os) W: no module named optik.__all__ (top-level import by optparse) W: no module named readline (delayed, conditional import by cmd) W: no module named readline (delayed import by pdb) W: no module named pwd (delayed, conditional import by posixpath) W: no module named org (top-level import by pickle) W: no module named posix (delayed, conditional import by iu) W: no module named fcntl (conditional import by subprocess) W: no module named org (top-level import by copy) W: no module named _emx_link (conditional import by os) W: no module named optik.__version__ (top-level import by optparse) W: no module named fcntl (top-level import by tempfile) W: __all__ is built strangely at line 0 - collections (C:\Python26\lib\collections.pyc) W: delayed exec statement detected at line 0 - collections (C:\Python26\lib\collections.pyc) W: delayed conditional __import__ hack detected at line 0 - doctest (C:\Python26\lib\doctest.pyc) W: delayed exec statement detected at line 0 - doctest (C:\Python26\lib\doctest.pyc) W: delayed conditional __import__ hack detected at line 0 - doctest (C:\Python26\lib\doctest.pyc) W: delayed __import__ hack detected at line 0 - encodings (C:\Python26\lib\encodings\__init__.pyc) W: __all__ is built strangely at line 0 - optparse (C:\Python26\pyinstaller-1.3\optparse.pyc) W: __all__ is built strangely at line 0 - dis (C:\Python26\lib\dis.pyc) W: delayed eval hack detected at line 0 - os (C:\Python26\lib\os.pyc) W: __all__ is built strangely at line 0 - __future__ (C:\Python26\lib\__future__.pyc) W: delayed conditional __import__ hack detected at line 0 - unittest (C:\Python26\lib\unittest.pyc) W: delayed conditional __import__ hack detected at line 0 - unittest (C:\Python26\lib\unittest.pyc) W: __all__ is built strangely at line 0 - tokenize (C:\Python26\lib\tokenize.pyc) W: __all__ is built strangely at line 0 - wx (C:\Python26\lib\site-packages\wx-2.8-msw-unicode\wx\__init__.pyc) W: __all__ is built strangely at line 0 - wx (C:\Python26\lib\site-packages\wx-2.8-msw-unicode\wx\__init__.pyc) W: delayed exec statement detected at line 0 - bdb (C:\Python26\lib\bdb.pyc) W: delayed eval hack detected at line 0 - bdb (C:\Python26\lib\bdb.pyc) W: delayed eval hack detected at line 0 - bdb (C:\Python26\lib\bdb.pyc) W: delayed __import__ hack detected at line 0 - pickle (C:\Python26\lib\pickle.pyc) W: delayed __import__ hack detected at line 0 - pickle (C:\Python26\lib\pickle.pyc) W: delayed conditional exec statement detected at line 0 - iu (C:\Python26\pyinstaller-1.3\iu.pyc) W: delayed conditional exec statement detected at line 0 - iu (C:\Python26\pyinstaller-1.3\iu.pyc) W: delayed eval hack detected at line 0 - gettext (C:\Python26\lib\gettext.pyc) W: delayed __import__ hack detected at line 0 - optik.option_parser (C:\Python26\pyinstaller-1.3\optik\option_parser.pyc) W: delayed conditional eval hack detected at line 0 - warnings (C:\Python26\lib\warnings.pyc) W: delayed conditional __import__ hack detected at line 0 - warnings (C:\Python26\lib\warnings.pyc) W: __all__ is built strangely at line 0 - optik (C:\Python26\pyinstaller-1.3\optik\__init__.pyc) W: delayed exec statement detected at line 0 - pdb (C:\Python26\lib\pdb.pyc) W: delayed conditional eval hack detected at line 0 - pdb (C:\Python26\lib\pdb.pyc) W: delayed eval hack detected at line 0 - pdb (C:\Python26\lib\pdb.pyc) W: delayed conditional eval hack detected at line 0 - pdb (C:\Python26\lib\pdb.pyc) W: delayed eval hack detected at line 0 - pdb (C:\Python26\lib\pdb.pyc) I don't know what the heck to make of any of that. Again, my searches have been fruitless.

Search Results

Search found 56 results on 3 pages for 'tokenize'.

Page 1/3 | 1 2 3 | Next Page >

- by Pete Alvin

- by pocoa

- by user112976

- by Bill the Lizard

- by Van

- by Shark

- by Amaç Herdagdelen

- by zhon

- by Thomas

- by Isaac Copper

- by Recursion

- by Ron Holcomb

- by Recursion

- by subtenante

- by el10780

- by Chris Dutrow

- by Pete

- by user69514

- by AJ

- by Niels Basjes

- by Ali_IT

- by vander

- by Hao Wooi Lim

- by Loki

- by Helgi Hrafn Gunnarsson

1 2 3 | Next Page >