Search Results

Search found 234 results on 10 pages for 'stanford nlp'.

Page 2/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >

Natural language processing - Ideas for beginner's projects

- by Microkernel

Hi guys, I am a beginner in NLP and NLTK. I am very interested in NLP and hence joined a weekend course on AI in some local institution, which requires me to do a project for completion of the course, and I decided to do it in NLP. The problem is,the instructor is not good at all for this course (According to me she is just a charlatan) (or may not be very interested in teaching as this is her last batch here after which the institute is going to send her out). So I am stuck in a situation where where I got to finish this project in a month to one and half months period, but as a naive person in the field I am feeling it very difficult to comprehend the things required to decide on project. (Also, as I am working full time, I am not finding enough time to dedicate on this). I considered using NLTK toolkit in python for the project for following reasons. (1) Python is famous for ease of use, rapid prototyping and very active community (considering very short span of time I have, and as I am a C programmer by profession, I need a language that I can learn fast and is simple to use). (2) NLTk has good review, and extensive documentation and a very active community. So the problem is what project should I take up, so that I can learn something and will be able to finish project in time. (I know almost nothing in NLP, don't even know what exactly corpora is... :( ) So, please suggest me some topics that I should consider for the project. Regards, MicroKernel :)

Read the article
Sentence Tree v/s Words List

- by Rohit Jose

I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities. One approach that was suggested was to keep a lookup table for all the know connector words like articles and conjunctions, remove them from the words list after splitting the sentence on the basis of the spaces. This would leave out the Name Entities in the sentence. A lookup is then done for these identified entities on another lookup table that associates them to the entity type, for example if the sentence was: Remember the Titans was a movie directed by Boaz Yakin, the possible outputs would be: {Remember the Titans,Movie} was {a movie,Movie} directed by {Boaz Yakin,director} {Remember the Titans,Movie} was a movie directed by Boaz Yakin {Remember the Titans,Movie} was {a movie,Movie} directed by Boaz Yakin {Remember the Titans,Movie} was a movie directed by {Boaz Yakin,director} Remember the Titans was {a movie,Movie} directed by Boaz Yakin Remember the Titans was {a movie,Movie} directed by {Boaz Yakin,director} Remember the Titans was a movie directed by {Boaz Yakin,director} Remember the {the titans,Movie,Sports Team} was {a movie,Movie} directed by {Boaz Yakin,director} Remember the {the titans,Movie,Sports Team} was a movie directed by Boaz Yakin Remember the {the titans,Movie,Sports Team} was {a movie,Movie} directed by Boaz Yakin Remember the {the titans,Movie,Sports Team} was a movie directed by {Boaz Yakin,director} The entity lookup table here would contain the following data: Remember the Titans=Movie a movie=Movie Boaz Yakin=director the Titans=Movie the Titans=Sports Team Another alternative logic that was put forward was to build a crude sentence tree that would contain the connector words in the lookup table as parent nodes and do a lookup in the entity table for the leaf node that might contain the entities. The tree that was built for the sentence above would be: The question I am faced with is the benefits of the two approaches, should I be going for the tree approach to represent the sentence parsing, since it provides a more semantic structure? Is there a better approach I should be going for solving it?

Read the article
What algorithm(s) can be used to achieve reasonably good next word prediction?

- by yati sagade

What is a good way of implementing "next-word prediction"? For example, the user types "I am" and the system suggests "a" and "not" (or possibly others) as the next word. I am aware of a method that uses Markov Chains and some training text(obviously) to more or less achieve this. But I read somewhere that this method is very restrictive and applies to very simple cases. I understand basics of neural networks and genetic algorithms(though have never used them in a serious project) and maybe they could be of some help. I wonder if there are any algorithms that, given appropriate training text(e.g., newspaper articles, and the user's own typing) can come up with reasonably appropriate suggestions for the next word. If not (links to)algorithms, general high-level methods to attack this problem are welcome.

Read the article
Stanford iPhone dev Paparazzi 1 setup help

- by cksubs

Hi, I'm having trouble using a tab bar control. Basically, when I build and run I'm just getting the blank white default "Window" screen. I was following this guide: http://www.iphoneosdevcafe.com/2010/03/assignment-4-part-1/#comment-36 Drag a Tab Bar Controller from the Library to MainWindow.xib. Control drag from the App Delegate to the Tab Bar Controller. Drag a Navigation Controller from the Library to MainWindow.xib. Control drag from the App Delegate to the Navigation Controller. Drag a second Navigation Controller from the Library to MainWindow.xib. Control drag from the App Delegate to the Navigation Controller. This completes all the connections between the App Delegate and the tab bar and two navigation controllers. By using IB to set up the tab bar and navigation controllers in this way, you need not allocate and init the controllers in AppDelegate.m. When you build and run this, you will see the tab bar controller and two navigation controllers. Is there a step there that he missed? How do I hook the Tab Bar Controller up to the Window? EDIT: Do I need to do something like this? mainTabBar = [[UITabBarController alloc] init]; [window addSubview:mainTabBar.view]; That's still not working, but I feel like I'm on the right track? Why can't this all be done from Interface Builder?

Read the article
Error when compiling code with the Protege API

- by Anto

I am new to Protege API and I have just created on Eclipse a small application which uses an external OWL file. Also I did import all the necessary libraries. import java.util.Collection; import java.util.Iterator; import edu.stanford.smi.protege.exception.OntologyLoadException; import edu.stanford.smi.protegex.owl.ProtegeOWL; import edu.stanford.smi.protegex.owl.model.*; public class Trial { public static void main(String[] args) throws OntologyLoadException{ String uri = "C:/Documents and Settings/Anto/Desktop/travel.owl"; OWLModel owlModel = ProtegeOWL.createJenaOWLModelFromURI(uri); Collection classes = owlModel.getUserDefinedOWLNamedClasses(); for(Iterator it = classes.iterator(); it.hasNext();){ OWLNamedClass cls = (OWLNamedClass) it.next(); Collection instances = cls.getInstances(false); System.out.println("Class " + cls.getBrowserText()+ " (" + instances.size()+")"); for(Iterator jt = instances.iterator(); jt.hasNext();){ OWLIndividual individual = (OWLIndividual) jt.next(); System.out.println(" - "+ individual.getBrowserText()); } } } } When I do compile however I get the following errors: WARNING: [Local Folder Repository] The specified file must be a directory. (C:\Documents and Settings\Anto\My Documents\Eclipse Workspace\ProtegeTrial\plugins\edu.stanford.smi.protegex.owl) LocalFolderRepository.update() SEVERE: Exception caught -- java.net.URISyntaxException: Illegal character in path at index 12: C:/Documents and Settings/CiuffreA/Desktop/travel.owl at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.parseHierarchical(URI.java:3066) at java.net.URI$Parser.parse(URI.java:3014) at java.net.URI.<init>(URI.java:578) at edu.stanford.smi.protegex.owl.jena.JenaKnowledgeBaseFactory.getFileURI(Unknown Source) at edu.stanford.smi.protegex.owl.jena.JenaKnowledgeBaseFactory.loadKnowledgeBase(Unknown Source) at edu.stanford.smi.protege.model.Project.loadDomainKB(Unknown Source) at edu.stanford.smi.protege.model.Project.createDomainKnowledgeBase(Unknown Source) at edu.stanford.smi.protegex.owl.jena.creator.OwlProjectFromUriCreator.create(Unknown Source) at edu.stanford.smi.protegex.owl.ProtegeOWL.createJenaOWLModelFromURI(Unknown Source) at Trial.main(Trial.java:14) Exception in thread "main" java.lang.NullPointerException at edu.stanford.smi.protegex.owl.jena.JenaKnowledgeBaseFactory.loadKnowledgeBase(Unknown Source) at edu.stanford.smi.protege.model.Project.loadDomainKB(Unknown Source) at edu.stanford.smi.protege.model.Project.createDomainKnowledgeBase(Unknown Source) at edu.stanford.smi.protegex.owl.jena.creator.OwlProjectFromUriCreator.create(Unknown Source) at edu.stanford.smi.protegex.owl.ProtegeOWL.createJenaOWLModelFromURI(Unknown Source) at Trial.main(Trial.java:14) Does anyone have an idea on where the problem should be?

Read the article
SOLR and Natural Language Parsing - Can I use it?

- by andy

hey guys, my requirements are pretty similar to this: Requirements http://stackoverflow.com/questions/90580/word-frequency-algorithm-for-natural-language-processing Using Solr While the answer for that question is excellent, I was wondering if I could make use of all the time I spent getting to know SOLR for my NLP. I thought of SOLR because: It's got a bunch of tokenizers and performs a lot of NLP. It's pretty use to use out of the box. It's restful distributed app, so it's easy to hook up I've spent some time with it, so using could save me time. Can I use Solr? Although the above reasons are good, I don't know SOLR THAT well, so I need to know if it would be appropriate for my requirements. Ideal Usage Ideally, I'd like to configure SOLR, and then be able to send SOLR some text, and retrieve the indexed tonkenized content. Context So you guys know, I'm working on a small component of a bigger recommendation engine.

Read the article
Online Introduction to Relational Databases (and not only) with Stanford University!

- by Luca Zavarella

How many of you know exactly the definition of "relational database"? What exactly the adjective "relational" refers to? Many of you allow themselves to be deceived, thinking this adjective is related to foreign key constraints between tables. Instead this adjective lurks in a world based on set theory, relational algebra and the concept of relationship intended as a table.Well, for those who want to deep the fundamentals of relational model, relational algebra, XML, OLAP and emerging "NoSQL" systems, Stanford University School of Engineering offers a public and free online introductory course to databases. This is the related web page: http://www.db-class.com/ The course will last 2 months, after which there will be a final exam. Passing the final exam will entitle the participants to receive a statement of accomplishment. A syllabus and more information is available here. Happy eLearning to you!

Read the article
Algorithm for Negating Sentences

- by Kevin Dolan

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation. For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad". Obviously, accomplishing this with a high degree of accuracy would probably be beyond the scope of current NLP, but I'm sure there has been some work on the subject. If anybody knows of any work, care to point me to some papers?

Read the article
how to programming SYSTEM for reading_comperhension question in English.

- by michael123

hi , i have to do some study for reading_comperhension in English. my work is ok but there is part from natural language - nlp area that i have to used . i want some help about QAsystem , how to answer the reading_comperhension automaticly .i have simple system about it that i get from this website http://www.cs.utah.edu/contest/2003/ , there is simple system in java but did not work to my i try to load the file from Remedia dataset that have the reading_comperhension story, but no result. after run this system , i have to develop by current technique such as rule based or pattern machine or combined with simple named entity.how to make that and which of them is petter to combined with the QAsys. thank you

Read the article
Building dictionary of words from large text

- by LiorH

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary. I know this is a common essential preprocessing step for NLP. Does anyone know of a tool\project that can perform this task? Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

Read the article
How to get logical parts of a sentence with java?

- by roddik

Hello. Let's say there is a sentence: On March 1, he was born. Changing it to He was born on March 1. doesn't break the sense of the sentence and it is still valid. Shuffling words in any other way would produce weird to invalid sentences. So basically, I'm talking about parts of the sentence, which make the information more specific, but removing them doesn't break the whole sentence. Is there any NLP library in which identifying such parts is available?

Read the article
Java or Python distributed compute job (on a student budget)?

- by midget_sadhu

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like: Ipython DISCO After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated. Amazon's EC2 etc not really an option, as i have next to no budget.

Read the article
Word list sources

- by warren

I am looking for a source of nouns, adverbs, adjectives, and verbs in several languages. I'd like the lists to already be split apart, and not have to go through the OED (and non-English equivalents) by hand re-creating said lists. I don't really care about definitions, and I understand some words can be multiple parts of speech - that's fine - words like "many" could be a noun or adjective, and can appear in both lists. Does anyone here know of such a source? If not, might someone be able to point me in the right direction?

Read the article
Which is better? OpenCyc or ConceptNet?

- by Daniel Loureiro

Hi, I'm doing a NLP project where I need to recognise concepts in sentences to find other similar concepts. I do this to infer word valences from a list I already have. I started using WordNet, but it gave many contradictory results. By contradictory results I mean word expansions that had contradictory valences. So now I'm looking into ConceptNet and OpenCyc. I've already implemented ConceptNet and it was all very easy and I love it. Problem is that OpenCyc appears to have a much larger and more logically rigid database, which is important when I found so many "contradictions" on WordNet... But I wouldn't know because I haven't tried it. Could someone tell me if it's worth going through the (considerable, for me) effort to implement OpenCyc, or is ConceptNet good enough to infer word valences? Are they that different? I'll be happy to explain myself further, if needed. Trying to keep it short for now! Thanks!

Read the article
Indexing and Searching Over Word Level Annotation Layers in Lucene

- by dmcer

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like: Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows: Query: Word=Washington,NER=Person I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like: Query: "NER=Person Word=arrived Word=at NER=Location" What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Read the article
getting text that will be displayed to user from html

- by gordatron

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to: Get all the text that will be displayed to the user in a browser from HTML. My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output). If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g: items in an ul or option tag could be separated by full stops (or to be honest just ignored). I am working Java, but would be interested in seeing any code that does this. I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-). An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop. Any pointers or suggestions very welcome.

Read the article
How to pass a function in a function?

- by SoulBeaver

That's an odd title. I would greatly appreciate it if somebody could clarify what exactly I'm asking because I'm not so sure myself. I'm watching the Stanford videos on Programming Paradigms(that teacher is awesome) and I'm up to video five when he started doing this: void *lSearch( void* key, void* base, int elemSize, int n, int (*cmpFn)(void*, void*)) Naturally, I thought to myself, "Oi, I didn't know you could declare a function and define it later!". So I created my own C++ test version. int foo(int (*bar)(void*, void*)); int bar(void* a, void* b); int main(int argc, char** argv) { int *func = 0; foo(bar); cin.get(); return 0; } int foo(int (*bar)(void*, void*)) { int c(10), d(15); int *a = &c; int *b = &d; bar(a, b); return 0; } int bar(void* a, void* b) { cout << "Why hello there." << endl; return 0; } The question about the code is this: it fails if I declare function int *bar as a parameter of foo, but not int (*bar). Why!? Also, the video confuses me in the fact that his lSearch definition void* lSearch( /*params*/ , int (*cmpFn)(void*, void*)) is calling cmpFn in the definition, but when calling the lSearch function lSearch( /*params*/, intCmp ); also calls the defined function int intCmp(void* elem1, void* elem2); and I don't get how that works. Why, in lSearch, is the function called cmpFn, but defined as intCmp, which is of type int, not int* and still works? And why does the function in lSearch not have to have defined parameters?

Read the article
Sentiment analysis with NLTK python for sentences using sample data or webservice?

- by Ke

I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice) I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??) Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron" Then I would like to check for positive/negative sentiment in each sentence and count them accordingly NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm. Here are the troubles I am having: All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?) I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary? I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link? Any help is much appreciated and will save me much hair! Cheers Ke

Read the article
understanding semcor corpus structure h

- by Sharmila

I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor. <s snum="1"> <wf cmd="ignore" pos="DT">The</wf> <wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf> <wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf> <wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf> <wf cmd="ignore" pos="DT">an</wf> <wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf> <wf cmd="ignore" pos="IN">of</wf> <wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf> <wf cmd="ignore" pos="POS">'s</wf> <wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf> <wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf> <wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf> <punc>``</punc> <wf cmd="ignore" pos="DT">no</wf> <wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf> <punc>''</punc> <wf cmd="ignore" pos="IN">that</wf> <wf cmd="ignore" pos="DT">any</wf> <wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf> <wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf> <punc>.</punc> </s> I'm assuming wnsn is 'word sense'. Is it correct? What does the attribute lexsn mean? How does it map to wordnet? What does the attribute pn refer to? (third line) How is the rdf attribute assigned? (again third line) In general, what are the possible attributes?

Read the article
Text mining with PHP

- by garyc40

Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP? I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient. Do you have any idea what I should use for this project? Or should I just switch to Python? Thanks

Read the article
Python - pyparsing unicode characters

- by mgj

Hi..:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8) The code specifies the grammar and parses accordingly. 671.assess :: ????? ::2 x=number + "." + src + "::" + w + "::" + number + "." + number If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format. I mean that the code works when we have something of the form 671.assess :: ahsaas ::2 i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose. The python code looks like this: # -*- coding: utf-8 -*- from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , # grammar src = Word(printables) trans = Word(printables) number = Word(nums) x=number + "." + src + "::" + trans + "::" + number + "." + number #parsing for eng-dict efiledata = open('b1aop_or_not_word.txt').read() eresults = x.parseString(efiledata) edict1 = {} edict2 = {} counter=0 xx=list() for result in eresults: trans=""#translation string ew=""#english word xx=result[0] ew=xx[2] trans=xx[4] edict1 = { ew:trans } edict2.update(edict1) print len(edict2) #no of entries in the english dictionary print "edict2 has been created" print "english dictionary" , edict2 #parsing for hin-dict hfiledata = open('b1aop_or_not_word.txt').read() hresults = x.scanString(hfiledata) hdict1 = {} hdict2 = {} counter=0 for result in hresults: trans=""#translation string hw=""#hin word xx=result[0] hw=xx[2] trans=xx[4] #print trans hdict1 = { trans:hw } hdict2.update(hdict1) print len(hdict2) #no of entries in the hindi dictionary print"hdict2 has been created" print "hindi dictionary" , hdict2 ''' ####################################################################################################################### def translate(d, ow, hinlist): if ow in d.keys():#ow=old word d=dict print ow , "exists in the dictionary keys" transes = d[ow] transes = transes.split() print "possible transes for" , ow , " = ", transes for word in transes: if word in hinlist: print "trans for" , ow , " = ", word return word return None else: print ow , "absent" return None f = open('bidir','w') #lines = ["'\ #5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ #5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ #'"] data=open('bi_full_2','rb').read() lines = data.split('!@#$%') loc=0 for line in lines: eng, hin = [subline.split(' # ') for subline in line.strip('\n').split('\n')] for transdict, source, dest in [(edict2, eng, hin), (hdict2, hin, eng)]: sourcethings = source[2].split() for word in source[1].split(): tl = dest[1].split() otherword = translate(transdict, word, tl) loc = source[1].split().index(word) if otherword is not None: otherword = otherword.strip() print word, ' <-> ', otherword, 'meaning=good' if otherword in dest[1].split(): print word, ' <-> ', otherword, 'trans=good' sourcethings[loc] = str( dest[1].split().index(otherword) + 1) source[2] = ' '.join(sourcethings) eng = ' # '.join(eng) hin = ' # '.join(hin) f.write(eng+'\n'+hin+'\n\n\n') f.close() ''' if an example input sentence for the source file is: 1# 5 # modern markets : confident consumers # 0 0 0 0 0 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 !@#$% the ouptut would look like this :- 1# 5 # modern markets : confident consumers # 1 2 3 4 5 1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 !@#$% Output Explanation:- This achieves bidirectional alignment. It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop. The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping. What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

Read the article
Algorithm for analyzing text of words

- by Click Upvote

I want an algorithm which would create all possible phrases in a block of text. For example, in the text: "My username is click upvote. I have 4k rep on stackoverflow" It would create the following combinations: "My username" "My Username is" "username is click" "is click" "is click upvote" "click upvote" "i have" "i have 4k" "have 4k" .. You get the idea. Basically the point is to get all possible combinations of 'phrases' out of a sentence. Any thoughts for how to best implement this?

Read the article
Looking for a good semantic parser for the Russian language.

- by Gregory Gelfond

Does anyone known of a semantic parser for the Russian language? I've attempted to configure the link-parser available from link-grammar site but to no avail. I'm hoping for a system that can run on the Mac and generate either a prolog or lisp-like representation of the parse tree (but XML output is fine as well). Thank you kindly in advance, Gregory Gelfond

Read the article
Sentiment analysis for twitter in python

- by Ran

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use? I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets. I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python. I haven't been able to find such sentiment analyzer so far, specifically not in python. Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python. Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts. BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself. Thanks! BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

Read the article
Efficient Context-Free Grammar parser, preferably Python-friendly

- by Max Shawabkeh

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical entries have little to no effect on performance. Another problem is that loading the (25MB) grammar+lexicon at the beginning can take up to a minute. From what I can find in literature, the running time of the algorithm used to parse such a grammar (Earley or CKY) should be linear to the size of the grammar and cubic to the size of the input token list. My experience with NLTK indicates that ambiguity is what hurts the performance most, not the absolute size of the grammar. So now I'm looking for a CFG parser to replace NLTK. I've been considering PLY but I can't tell whether it supports feature structures in CFGs, which are required in my case, and the examples I've seen seem to be doing a lot of procedural parsing rather than just specifying a grammar. Can anybody show me an example of PLY both supporting feature structs and using a declarative grammar? I'm also fine with any other parser that can do what I need efficiently. A Python interface is preferable but not absolutely necessary.

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >