nlp - Page 3 - Developer IT

Determining whether values can potentially match a regular expression, given more input

- by Andreas Grech

I am currently writing an application in JavaScript where I'm matching input to regular expressions, but I also need to find a way how to match strings to parts of the regular expressions. For example: var invalid = "x", potentially = "g", valid = "ggg", gReg = /^ggg$/; gReg.test(invalid); //returns false (correct) gReg.test(valid); //returns true (correct) Now I need to find a way to somehow determine that the value of the potentially variable doesn't exactly match the /^ggg$/ expression, BUT with more input, it potentially can! So for example in this case, the potentially variable is g, but if two more g's are appended to it, it will match the regular expression /^ggg$/ But in the case of invalid, it can never match the /^ggg$/ expression, no matter how many characters you append to it. So how can I determine if a string has or doesn't have potential to match a particular regular expression?

Read the article

Naive Bayesian for Topic detection using "Bag of Words" approach

- by AlgoMan

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words. How should i go about doing this ? Is my approach the right one ? Which programming language would be best suited for the implementation ?

Read the article

Writing annotataion schemas for Callisto

- by Ken Bloom

Does anybody know where I can find documentation on how to write annotation schemas for Callisto? I'm looking to write something a little more complicated than I can generate from a DTD -- that only gives me the ability to tag different kinds of text mentions. I'm looking to create a schema that represents a single type of relationship between five or six different kinds of textual mentions (and some of these types of mentions have attributes that I need to assign values to), and possibly having a second type of relationship between the first two instances of the first type of relationship. (Alternatively, does anybody know of any software that would be better for this kind of schema? I've been looking at WordFreak, but it's a little clumsy, and it doesn't support attributes on its textual mentions.)

Read the article

Searching text for geonames

- by Vojtech R.

Hi, which part of huge package nltk I must study and use, if I need mark geonames in text?

Read the article

How to extract common / significant phrases from a series of text entries

- by arronsky

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

Read the article

Detecting syllables in a word

- by user50705

I need to find a fairly efficient way to detect syllables in a word. E.g., invisible - in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. e.g., pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC) I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful). The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis. I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches. I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Read the article

Dependency parsing

- by C.

Hi I particularly like the transduce feature offered by agfl in their EP4IR http://www.agfl.cs.ru.nl/EP4IR/english.html The download page is here: http://www.agfl.cs.ru.nl/download.html Is there any way i can make use of this in a c# program? Do I need to convert classes to c#? Thanks :)

Read the article

Theory: "Lexical Encoding"

- by _ande_turner_

I am using the term "Lexical Encoding" for my lack of a better one. A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence. // An simplified example of a "Lexical Encoding" String sentence = "How are you today?"; int[] sentence = { 93, 22, 14, 330, QUERY }; In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark. Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure. What are other examples of "Lexical Encoding" techniques? If you were interested in where the word-usage statistics come from : http://www.wordcount.org

Read the article

Python/PyParsing: Difficulty with setResultsName

- by Rosarch

I think I'm making a mistake in how I call setResultsName(): from pyparsing import * DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("Dept Code") COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("Course Number") COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0])) course = DEPT_CODE + COURSE_NUMBER course.setResultsName("course") statement = course From IDLE: >>> myparser import * >>> statement.parseString("CS 2110") (['CS', 2110], {'Dept Code': [('CS', 0)], 'Course Number': [(2110, 1)]}) The output I hope for: >>> myparser import * >>> statement.parseString("CS 2110") (['CS', 2110], {'Course': ['CS', 2110], 'Dept Code': [('CS', 0)], 'Course Number': [(2110, 1)]}) Does setResultsName() only work for terminals?

Read the article

Java text classification problem

- by yox

Hello, I have a set of Books objects, classs Book is defined as following : Class Book{ String title; ArrayList<tags> taglist; } Where title is the title of the book, example : Javascript for dummies. and taglist is a list of tags for our example : Javascript, jquery, "web dev", .. As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it.. I have to classify automaticaly those books into separated sets by topic, example : IT BOOKS : Java for dummies Javascript for dummies Learn flash in 30 days C++ programming HISTORY BOOKS : World wars America in 1960 Martin luther king's life BIOLOGY BOOKS : .... Do you guys know a classification algorithm/method to apply for that kind of problems ? A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..

Read the article

How to get parent node in Stanford's JavaNLP?

- by roddik

Hello. Suppose I have such chunk of a sentence: (NP (NP (DT A) (JJ single) (NN page)) (PP (IN in) (NP (DT a) (NN wiki) (NN website)))) At a certain moment of time I have a reference to (JJ single) and I want to get the NP node binding A single page. If I get it right, that NP is the parent of the node, A and page are its siblings and it has no children (?). When I try to use the .parent() method of a tree, I always get null. The API says that's because the implementation doesn't know how to determine the parent node. Another method of interest is .ancestor(int height, Tree root), but I don't know how to get the root of the node. In both cases, since the parser knows how to indent and group trees, it must know the "parent" tree, right? How can I get it? Thanks

Read the article

Using Markov models to convert all caps to mixed case and related problems

- by hippietrail

I've been thinking about using Markov techniques to restore missing information to natural language text. Restore mixed case to text in all caps Restore accents / diacritics to languages which should have them but have been converted to plain ASCII Convert rough phonetic transcriptions back into native alphabets That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context. I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities. Am I on the right track? Are there already some services, libraries, or tools for this sort of thing? Examples GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush tantot il rit a gorge deployee - tantôt il rit à gorge déployée

Read the article

Ngram IDF smoothing

- by adi92

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf.. For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf.. I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest. If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores. Anybody has any better ideas?

Read the article

Compose synthetic English phrase that would contain 160 bits of recoverable information

- by Alexander Gladysh

I have 160 bits of random data. Just for fun, I want to generate pseudo-English phrase to "store" this information in. I want to be able to recover this information from the phrase. Note: This is not a security question, I don't care if someone else will be able to recover the information or even detect that it is there or not. Criteria for better phrases, from most important to the least: Short Unique Natural-looking The current approach, suggested here: Take three lists of 1024 nouns, verbs and adjectives each (picking most popular ones). Generate a phrase by the following pattern, reading 20 bits for each word: Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Now, this seems to be a good approach, but the phrase is a bit too long and a bit too dull. I have found a corpus of words here (Part of Speech Database). After some ad-hoc filtering, I calculated that this corpus contains, approximately 50690 usable adjectives 123585 nouns 15301 verbs This allows me to use up to 16 bits per adjective (actually 16.9, but I can't figure how to use fractional bits) 15 bits per noun 13 bits per verb For noun-verb-adjective-verb pattern this gives 57 bits per "sentence" in phrase. This means that, if I'll use all words I can get from this corpus, I can generate three sentences instead of four (160 / 57 ˜ 2.8). Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Still a bit too long and dull. Any hints how can I improve it? What I see that I can try: Try to compress my data somehow before encoding. But since the data is completely random, only some phrases would be shorter (and, I guess, not by much). Improve phrase pattern, so it would look better. Use several patterns, using the first word in phrase to somehow indicate for future decoding which pattern was used. (For example, use the last letter or even the length of the word.) Pick pattern according to the first bytes of the data. ...I'm not that good with English to come up with better phrase patterns. Any suggestions? Use more linguistics in the pattern. Different tenses etc. ...I guess, I would need much better word corpus than I have now for that. Any hints where can I get a suitable one?

Read the article

PyParsing: What does Combine() do?

- by Rosarch

What is the difference between: foo = TOKEN1 + TOKEN2 and foo = Combine(TOKEN1 + TOKEN2) Thanks.

Read the article

PyParsing: Not all tokens passed to setParseAction()

- by Rosarch

I'm parsing sentences like "CS 2110 or INFO 3300". I would like to output a format like: [[("CS" 2110)], [("INFO", 3300)]] To do this, I thought I could use setParseAction(). However, the print statements in statementParse() suggest that only the last tokens are actually passed: >>> statement.parseString("CS 2110 or INFO 3300") Match [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] at loc 7(1,8) string CS 2110 or INFO 3300 loc: 7 tokens: ['INFO', 3300] Matched [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] -> ['INFO', 3300] (['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]}) I expected all the tokens to be passed, but it's only ['INFO', 3300]. Am I doing something wrong? Or is there another way that I can produce the desired output? Here is the pyparsing code: from pyparsing import * def statementParse(str, location, tokens): print "string %s" % str print "loc: %s " % location print "tokens: %s" % tokens DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode") COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber") OR_CONJ = Suppress("or") COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0])) course = DEPT_CODE + COURSE_NUMBER.setResultsName("Course") statement = course + Optional(OR_CONJ + course).setParseAction(statementParse).setDebug()

Read the article

Algorithm to match natural text in mail

- by snøreven

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing. example: Hi tom, last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. list item 2 list item 3 list item 3 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit regards, K. ---line-of-funny-characters-####### example inc. 33 evil street, london mobile: 00 234534/234345 Ideally the algorithm would match only the bold parts. Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?

Read the article

Hierarchy of meaning

- by asldkncvas

I am looking for a method to build a hierarchy of words. Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words. For example, if I have the set which contains a "super" representation of others, i.e. [cat, dog, monkey, animal, bird, ... ] I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set. Note: they are NOT the same in meaning. cat != dog != monkey != animal BUT cat is a subset of animal and dog is a subset of animal. I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because: 1) Most words are not found in Wordnet 2) All the words are in another language; translation is possible but is to limited effect. another example would be: [ noise reduction, focal length, flash, functionality, .. ] so functionality includes everything in this set. I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either. Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)

Read the article

are there any c# libraries for Named Entity Recognition?

- by Taz

I am looking for any free libraries for Named Entity Recognition in c# or any other .net language.

Read the article

Ideas for designing an automated content tagging system needed

- by Benjamin Smith

I am currently designing a website that amongst other is required to display and organise small amounts of text content (mainly quotes, article stubs, etc.). I currently have a database with 250,000+ items and need to come up with a method of tagging each item with relevant tags which will eventually allow for easy searching/browsing of the content for users. A very simplistic idea I have (and one that I believe is employed by some sites that I have been looking to for inspiration (http://www.brainyquote.com/quotes/topics.html)), is to simply search the database for certain words or phrases and use these words as tags for the content. This can easily be extended so that if for example a user wanted to show all items with a theme of love then I would just return a list of items with words and phrases relating to this theme. This would not be hard to implement but does not provide very good results. For example if I were to search for the month 'May' in the database with the aim of then classifying the items returned as realting to the topic of Spring then I would get back all occurrences of the word May, regardless of the semantic meaning. Another shortcoming of this method is that I believe it would be quite hard to automate the process to any large scale. What I really require is a library that can take an item, break it down and analyse the semantic meaning and also return a list of tags that would correctly classify the item. I know this is a lot to ask and I have a feeling I will end up reverting to the aforementioned method but I just thought I should ask if anyone knew of any pre-existing solution. I think that as the items in the database are short then it is probably quite a hard task to analyse any meaning from them however I may be mistaken. Another path to possibly go down would be to use something like amazon turk to outsource the task which may produce good results but would be expensive. Eventually I would like users to be able to (and want to!) tag content and to vote for the most relevant tags, possibly using a gameification mechanic as motivation however this is some way down the line. A temporary fix may be the best thing if this were the route I decided to go down as I could use the rough results I got as the starting point for a more in depth solution. If you've read this far, thanks for sticking with me, I know I'm spitballing but any input would be really helpful. Thanks.

Read the article

How does Amazon's Statistically Improbable Phrases work?

- by ??iu

How does something like Statistically Improbable Phrases work? According to amazon: Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements. For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.

Read the article

Natural Language Processing in Ruby

- by Joey Robert

I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby? Similar to http://stackoverflow.com/questions/870460/java-is-there-a-good-natural-language-processing-library but for Ruby. I'd prefer something very general, but any leads are appreciated!

Read the article

Algorithm to classify a list of products?

- by Martin

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives. Seagate Hard Drive 500Go Seagate Hard Drive 120Go for laptop Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive New and shinny 500Go hard drive from Seagate Seagate Barracuda 7200.12 Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6. We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions? I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated! Thanks.

Read the article

Python: How best to parse a simple grammar?

- by Rosarch

Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale. I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them. From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.) Some sample inputs and outputs: "CS 2110" => ("CS", 2110) # 0 "CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1 "CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1 "CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2 "MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3 If the entire description is just a course, it is output directly. If the courses are conjoined ("and"), they are all output in the same list If the course are disjoined ("or"), they are in separate lists Here, we have both "and" and "or". One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3. What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates: def p_course(p): 'course : DEPT_CODE COURSE_NUMBER' p[0] = (p[1], int(p[2])) With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done. def addCourse(self, str, location, tokens): self.result.append((tokens[0][0], tokens[0][1])) def makeCourseList(self, str, location, tokens): dept = tokens[0][0] new_tokens = [(dept, tokens[0][1])] new_tokens.extend((dept, tok) for tok in tokens[1:]) self.result.append(new_tokens) For instance, to handle "or" cases: def __init__(self): self.result = [] # ... self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses) def disjunctionCourses(self, str, location, tokens): if len(tokens) == 1: return tokens print "disjunction tokens: %s" % tokens How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted... What's a better way to approach this problem?

Read the article

How to write efficient code for extracting Noun phrases?

- by Arun Abraham

I am trying to extract phrases using rules such as the ones mentioned below on text which has been POS tagged 1) NNP - NNP (- indicates followed by) 2) NNP - CC - NNP 3) VP - NP etc.. I have written code in this manner, Can someone tell me how i can do in a better manner. List<String> nounPhrases = new ArrayList<String>(); for (List<HasWord> sentence : documentPreprocessor) { //System.out.println(sentence.toString()); System.out.println(Sentence.listToString(sentence, false)); List<TaggedWord> tSentence = tagger.tagSentence(sentence); String lastTag = null, lastWord = null; for (TaggedWord taggedWord : tSentence) { if (lastTag != null && taggedWord.tag().equalsIgnoreCase("NNP") && lastTag.equalsIgnoreCase("NNP")) { nounPhrases.add(taggedWord.word() + " " + lastWord); //System.out.println(taggedWord.word() + " " + lastWord); } lastTag = taggedWord.tag(); lastWord = taggedWord.word(); } } In the above code, i have done only for NNP followed by NNP extraction, how can i generalise it so that i can add other rules too. I know that there are libraries available for doing this , but wanted to do this manually.

Search Results

Search found 128 results on 6 pages for 'nlp'.

Page 3/6 | < Previous Page | 1 2 3 4 5 6 | Next Page >

- by Andreas Grech

- by AlgoMan

- by Ken Bloom

- by Vojtech R.

- by arronsky

- by user50705

- by C.

- by _ande_turner_

- by Rosarch

- by yox

- by roddik

- by hippietrail

- by adi92

- by Alexander Gladysh

- by Rosarch

- by Rosarch

- by snøreven

- by asldkncvas

- by Taz

- by Benjamin Smith

- by ??iu

- by Joey Robert

- by Martin

- by Rosarch

- by Arun Abraham

< Previous Page | 1 2 3 4 5 6 | Next Page >