nlp - Page 2 - Developer IT

Algorithm for Negating Sentences

- by Kevin Dolan

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation. For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad". Obviously, accomplishing this with a high degree of accuracy would probably be beyond the scope of current NLP, but I'm sure there has been some work on the subject. If anybody knows of any work, care to point me to some papers?

Read the article

how to programming SYSTEM for reading_comperhension question in English.

- by michael123

hi , i have to do some study for reading_comperhension in English. my work is ok but there is part from natural language - nlp area that i have to used . i want some help about QAsystem , how to answer the reading_comperhension automaticly .i have simple system about it that i get from this website http://www.cs.utah.edu/contest/2003/ , there is simple system in java but did not work to my i try to load the file from Remedia dataset that have the reading_comperhension story, but no result. after run this system , i have to develop by current technique such as rule based or pattern machine or combined with simple named entity.how to make that and which of them is petter to combined with the QAsys. thank you

Read the article

Building dictionary of words from large text

- by LiorH

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary. I know this is a common essential preprocessing step for NLP. Does anyone know of a tool\project that can perform this task? Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

Read the article

How to get logical parts of a sentence with java?

- by roddik

Hello. Let's say there is a sentence: On March 1, he was born. Changing it to He was born on March 1. doesn't break the sense of the sentence and it is still valid. Shuffling words in any other way would produce weird to invalid sentences. So basically, I'm talking about parts of the sentence, which make the information more specific, but removing them doesn't break the whole sentence. Is there any NLP library in which identifying such parts is available?

Read the article

Java or Python distributed compute job (on a student budget)?

- by midget_sadhu

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like: Ipython DISCO After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated. Amazon's EC2 etc not really an option, as i have next to no budget.

Read the article

Word list sources

- by warren

I am looking for a source of nouns, adverbs, adjectives, and verbs in several languages. I'd like the lists to already be split apart, and not have to go through the OED (and non-English equivalents) by hand re-creating said lists. I don't really care about definitions, and I understand some words can be multiple parts of speech - that's fine - words like "many" could be a noun or adjective, and can appear in both lists. Does anyone here know of such a source? If not, might someone be able to point me in the right direction?

Read the article

Which is better? OpenCyc or ConceptNet?

- by Daniel Loureiro

Hi, I'm doing a NLP project where I need to recognise concepts in sentences to find other similar concepts. I do this to infer word valences from a list I already have. I started using WordNet, but it gave many contradictory results. By contradictory results I mean word expansions that had contradictory valences. So now I'm looking into ConceptNet and OpenCyc. I've already implemented ConceptNet and it was all very easy and I love it. Problem is that OpenCyc appears to have a much larger and more logically rigid database, which is important when I found so many "contradictions" on WordNet... But I wouldn't know because I haven't tried it. Could someone tell me if it's worth going through the (considerable, for me) effort to implement OpenCyc, or is ConceptNet good enough to infer word valences? Are they that different? I'll be happy to explain myself further, if needed. Trying to keep it short for now! Thanks!

Read the article

Indexing and Searching Over Word Level Annotation Layers in Lucene

- by dmcer

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like: Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows: Query: Word=Washington,NER=Person I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like: Query: "NER=Person Word=arrived Word=at NER=Location" What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Read the article

getting text that will be displayed to user from html

- by gordatron

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to: Get all the text that will be displayed to the user in a browser from HTML. My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output). If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g: items in an ul or option tag could be separated by full stops (or to be honest just ignored). I am working Java, but would be interested in seeing any code that does this. I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-). An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop. Any pointers or suggestions very welcome.

Read the article

Sentiment analysis with NLTK python for sentences using sample data or webservice?

- by Ke

I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice) I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??) Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron" Then I would like to check for positive/negative sentiment in each sentence and count them accordingly NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm. Here are the troubles I am having: All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?) I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary? I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link? Any help is much appreciated and will save me much hair! Cheers Ke

Read the article

understanding semcor corpus structure h

- by Sharmila

I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor. <s snum="1"> <wf cmd="ignore" pos="DT">The</wf> <wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf> <wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf> <wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf> <wf cmd="ignore" pos="DT">an</wf> <wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf> <wf cmd="ignore" pos="IN">of</wf> <wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf> <wf cmd="ignore" pos="POS">'s</wf> <wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf> <wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf> <wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf> <punc>``</punc> <wf cmd="ignore" pos="DT">no</wf> <wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf> <punc>''</punc> <wf cmd="ignore" pos="IN">that</wf> <wf cmd="ignore" pos="DT">any</wf> <wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf> <wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf> <punc>.</punc> </s> I'm assuming wnsn is 'word sense'. Is it correct? What does the attribute lexsn mean? How does it map to wordnet? What does the attribute pn refer to? (third line) How is the rdf attribute assigned? (again third line) In general, what are the possible attributes?

Read the article

Text mining with PHP

- by garyc40

Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP? I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient. Do you have any idea what I should use for this project? Or should I just switch to Python? Thanks

Read the article

Python - pyparsing unicode characters

- by mgj

Hi..:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8) The code specifies the grammar and parses accordingly. 671.assess :: ????? ::2 x=number + "." + src + "::" + w + "::" + number + "." + number If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format. I mean that the code works when we have something of the form 671.assess :: ahsaas ::2 i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose. The python code looks like this: # -*- coding: utf-8 -*- from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , # grammar src = Word(printables) trans = Word(printables) number = Word(nums) x=number + "." + src + "::" + trans + "::" + number + "." + number #parsing for eng-dict efiledata = open('b1aop_or_not_word.txt').read() eresults = x.parseString(efiledata) edict1 = {} edict2 = {} counter=0 xx=list() for result in eresults: trans=""#translation string ew=""#english word xx=result[0] ew=xx[2] trans=xx[4] edict1 = { ew:trans } edict2.update(edict1) print len(edict2) #no of entries in the english dictionary print "edict2 has been created" print "english dictionary" , edict2 #parsing for hin-dict hfiledata = open('b1aop_or_not_word.txt').read() hresults = x.scanString(hfiledata) hdict1 = {} hdict2 = {} counter=0 for result in hresults: trans=""#translation string hw=""#hin word xx=result[0] hw=xx[2] trans=xx[4] #print trans hdict1 = { trans:hw } hdict2.update(hdict1) print len(hdict2) #no of entries in the hindi dictionary print"hdict2 has been created" print "hindi dictionary" , hdict2 ''' ####################################################################################################################### def translate(d, ow, hinlist): if ow in d.keys():#ow=old word d=dict print ow , "exists in the dictionary keys" transes = d[ow] transes = transes.split() print "possible transes for" , ow , " = ", transes for word in transes: if word in hinlist: print "trans for" , ow , " = ", word return word return None else: print ow , "absent" return None f = open('bidir','w') #lines = ["'\ #5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ #5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ #'"] data=open('bi_full_2','rb').read() lines = data.split('!@#$%') loc=0 for line in lines: eng, hin = [subline.split(' # ') for subline in line.strip('\n').split('\n')] for transdict, source, dest in [(edict2, eng, hin), (hdict2, hin, eng)]: sourcethings = source[2].split() for word in source[1].split(): tl = dest[1].split() otherword = translate(transdict, word, tl) loc = source[1].split().index(word) if otherword is not None: otherword = otherword.strip() print word, ' <-> ', otherword, 'meaning=good' if otherword in dest[1].split(): print word, ' <-> ', otherword, 'trans=good' sourcethings[loc] = str( dest[1].split().index(otherword) + 1) source[2] = ' '.join(sourcethings) eng = ' # '.join(eng) hin = ' # '.join(hin) f.write(eng+'\n'+hin+'\n\n\n') f.close() ''' if an example input sentence for the source file is: 1# 5 # modern markets : confident consumers # 0 0 0 0 0 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 !@#$% the ouptut would look like this :- 1# 5 # modern markets : confident consumers # 1 2 3 4 5 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 !@#$% Output Explanation:- This achieves bidirectional alignment. It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop. The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping. What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

Read the article

Algorithm for analyzing text of words

- by Click Upvote

I want an algorithm which would create all possible phrases in a block of text. For example, in the text: "My username is click upvote. I have 4k rep on stackoverflow" It would create the following combinations: "My username" "My Username is" "username is click" "is click" "is click upvote" "click upvote" "i have" "i have 4k" "have 4k" .. You get the idea. Basically the point is to get all possible combinations of 'phrases' out of a sentence. Any thoughts for how to best implement this?

Read the article

Looking for a good semantic parser for the Russian language.

- by Gregory Gelfond

Does anyone known of a semantic parser for the Russian language? I've attempted to configure the link-parser available from link-grammar site but to no avail. I'm hoping for a system that can run on the Mac and generate either a prolog or lisp-like representation of the parse tree (but XML output is fine as well). Thank you kindly in advance, Gregory Gelfond

Read the article

Sentiment analysis for twitter in python

- by Ran

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use? I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets. I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python. I haven't been able to find such sentiment analyzer so far, specifically not in python. Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python. Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts. BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself. Thanks! BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

Read the article

How do you parse a paragraph of text into sentences? (perferrably in Ruby)

- by henry74

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays) UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence: Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive. CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question? Thoughts?

Read the article

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

- by Ankur Gupta

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc. Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in Can you Suggest an alternative strategy for extraction of pure content, Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ? How would you approach the above problem ?. Are these any research papers on the same ?. Regards Ankur Gupta

Read the article

Efficient Context-Free Grammar parser, preferably Python-friendly

- by Max Shawabkeh

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance. Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute. From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar. So now I'm looking for a CFG parser to replace NLTK. I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar? I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.

Read the article

How to determine the (natural) language of a document?

- by Robert Petermeier

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is Java Not free for "semi-commercial" usage This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French... A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

Read the article

another porter stemming algorithm implementation question ?

- by mike

Hi, I am trying to implement porter stemming algorithm, but i am having difficualties understanding this point Step 1c (*v*) Y -> I happy -> happi sky -> sky Isn't that the the opposite of what we want to do , why does the algorithim convert the Y into I. for the complete algorithm here http://tartarus.org/~martin/PorterStemmer/def.txt Thanks

Read the article

Latent Dirichlet Allocation, pitfalls, tips and programs

- by Gregg Lind

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis? Any unexpected pitfalls or tips I should know before embarking? I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

Read the article

Open Source Library for Linguistic Inquiry and Word Count (LIWC)

- by zfranciscus

Hi, I am looking for an open source library for Linguistic Inquiry and Word Count (LIWC). Something in java or python will be good, though I am open to use other language. Does anyone know where I can get one ? Cheers,

Read the article

calling Stanford POS Tagger maxentTagger from java program

- by Akansha

Hi. I am new to Stanford POS tagger. I need to call the Tagger from my jva program and direct the output to a text file. I have extracted the source files from Stanford-postagger and tried calling the maxentTagger, but all I find is errors and warnings. Can somebody tell me from the scratch about how to call maxentTagger in my program, setting the classpath if required and other such steps. Please help me out.

Read the article

How to make concept representation with the help of bag of words

- by agazerboy

Hi All, Thanks for stoping to read my question :) this is very sweet place full of GREAT peoples ! I have a question about "creating sentences with words". NO NO it is not about english grammar :) Let me explain, If I have bag of words like "person apple apple person person a eat person will apple eat hungry apple hungry" and it can generate some kind of following sentence "hungry person eat apple" I don't in which field this topic will relate. Where should I try to find an answer. I tried to search google but I only found english grammar stuff :) Any body there who can tell me which algo can work in this problem? or any program Thanks P.S: It is not an assignment :) if it would be i would ask for source code ! I don't even know in which field I should look for :)

Search Results

Search found 128 results on 6 pages for 'nlp'.

Page 2/6 | < Previous Page | 1 2 3 4 5 6 | Next Page >

- by Kevin Dolan

- by michael123

- by LiorH

- by roddik

- by midget_sadhu

- by warren

- by Daniel Loureiro

- by dmcer

- by gordatron

- by Ke

- by Sharmila

- by garyc40

- by mgj

- by Click Upvote

- by Gregory Gelfond

- by Ran

- by henry74

- by Ankur Gupta

- by Max Shawabkeh

- by Robert Petermeier

- by mike

- by Gregg Lind

- by zfranciscus

- by Akansha

- by agazerboy

< Previous Page | 1 2 3 4 5 6 | Next Page >