Search Results

Search found 85 results on 4 pages for 'classifier'.

Page 1/4 | 1 2 3 4 | Next Page >

Issue in understanding how to compare performance of classifier using ROC

- by user1214586

I am trying to demystify pattern recognition techniques and understood few of them. I am trying to design a classifier M. A gesture is classified based on the hamming distance between the sample time series y and the training time series x. The result of the classifier are probabilistic values. There are 3 classes/categories with labels A,B,C which classifies hand gestures where there are 100 samples for each class which are to be classified (single feature and data length=100). The data are different time series (x coordinate vs time). The training set is used to assign probabilities indicating which gesture has occured how many times. So,out of 10 training samples if gesture A appeared 6 times then probability that a gesture falls under category A is P(A)=0.6 similarly P(B)=0.3 and P(C)=0.1 Now, I am trying to compare the performance of this classifier with Bayes classifier, K-NN, Principal component analysis (PCA) and Neural Network. On what basis,parameter and method should I do it if I consider ROC or cross validate since the features for my classifier are the probabilistic values for the ROC plot hence what shall be the features for k-nn,bayes classification and PCA? Is there a code for it which will be useful. What should be the value of k is there are 3 classes of gestures? Please help. I am in a fix.

Read the article
Discrete and Continuous Classifier on Sparse Data

- by Chris S

I'm trying to classify an example, which contains discrete and continuous features. Also, the example represents sparse data, so even though the system may have been trained on 100 features, the example may only have 12. What would be the best classifier algorithm to use to accomplish this? I've been looking at Bayes, Maxent, Decision Tree, and KNN, but I'm not sure any fit the bill exactly. The biggest sticking point I've found is that most implementations don't support sparse data sets and both discrete and continuous features. Can anyone recommend an algorithm and implementation (preferably in Python) that fits these criteria? Libraries I've looked at so far include: Orange (Mostly academic. Implementations not terribly efficient or practical.) NLTK (Also academic, although has a good Maxent implementation, but doesn't handle continuous features.) Weka (Still researching this. Seems to support a broad range of algorithms, but has poor documentation, so it's unclear what each implementation supports.)

Read the article
Mahout Naive Bayes Classifier for Items

- by Nimesh Parikh

Team, I am working on a project where i need to classify Items into certain category. I have a single file as input; which contains target variable and space separated features. My training data will look like Category Name [Tab] DataString Plumbing [Tab] Pipe Tap Plastic Pipe PVC Pipe Cold Water Line Hot Water Line Tee outlet up Elbow turned up Elbow turned down Gate valve Globe valve Paint [Tab] Ivory Black Burnt Umber Caput Mortuum Violet Earth Red Yellow Ochre Titanium White Cadmium Yellow Light Cadmium Yellow Deep Cloths [Tab] Shirt T-Shirt Pent Jeans Tee Cargo Well, I have really big set of Category. I have couple of question here am i using correct data for Training? If no then what should i use? Once I train and Test my model, what is next step? How can i use output? Please help me with this Thanks, Nimesh

Read the article
How do I classify using SVM Classifier?

- by Gomathi

I'm doing a project in liver tumor classification. Actually I initially used Region Growing method for liver segmentation and from that I segmented tumor using FCM. I,then, obtained the texture features using Gray Level Co-occurence Matrix. My output for that was stats = autoc: [1.857855266614132e+000 1.857955341199538e+000] contr: [5.103143332457753e-002 5.030548650257343e-002] corrm: [9.512661919561399e-001 9.519459060378332e-001] corrp: [9.512661919561385e-001 9.519459060378338e-001] cprom: [7.885631654779597e+001 7.905268525471267e+001] Now how should I give this as an input to the SVM program. function [itr] = multisvm( T,C,tst ) %MULTISVM(2.0) classifies the class of given training vector according to the % given group and gives us result that which class it belongs. % We have also to input the testing matrix %Inputs: T=Training Matrix, C=Group, tst=Testing matrix %Outputs: itr=Resultant class(Group,USE ROW VECTOR MATRIX) to which tst set belongs %----------------------------------------------------------------------% % IMPORTANT: DON'T USE THIS PROGRAM FOR CLASS LESS THAN 3, % % OTHERWISE USE svmtrain,svmclassify DIRECTLY or % % add an else condition also for that case in this program. % % Modify required data to use Kernel Functions and Plot also% %----------------------------------------------------------------------% % Date:11-08-2011(DD-MM-YYYY) % % This function for multiclass Support Vector Machine is written by % ANAND MISHRA (Machine Vision Lab. CEERI, Pilani, India) % and this is free to use. email: [email protected] % Updated version 2.0 Date:14-10-2011(DD-MM-YYYY) u=unique(C); N=length(u); c4=[]; c3=[]; j=1; k=1; if(N>2) itr=1; classes=0; cond=max(C)-min(C); while((classes~=1)&&(itr<=length(u))&& size(C,2)>1 && cond>0) %This while loop is the multiclass SVM Trick c1=(C==u(itr)); newClass=c1; svmStruct = svmtrain(T,newClass); classes = svmclassify(svmStruct,tst); % This is the loop for Reduction of Training Set for i=1:size(newClass,2) if newClass(1,i)==0; c3(k,:)=T(i,:); k=k+1; end end T=c3; c3=[]; k=1; % This is the loop for reduction of group for i=1:size(newClass,2) if newClass(1,i)==0; c4(1,j)=C(1,i); j=j+1; end end C=c4; c4=[]; j=1; cond=max(C)-min(C); % Condition for avoiding group %to contain similar type of values %and the reduce them to process % This condition can select the particular value of iteration % base on classes if classes~=1 itr=itr+1; end end end end Kindly guide me. Images:

Read the article
When to choose which machine learning classifier?

- by LM

Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.) How do I know which classifier I should use? (Decision tree, SVM, Bayesian, logistic regression, etc.) In which cases is one of them the "natural" first choice, and what are the principles for choosing that one? Examples of the type of answers I'm looking for (from Manning et al.'s "Introduction to Information Retrieval book": http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html): a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes). [I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.] b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability. What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though. Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)? [Not sure if stackoverflow is the best place to ask this question, since it's more machine learning than actual programming -- if not, any suggestions for where else?]

Read the article
Visualize Classifier Error Weka

- by user1780592

Hye there i have a have datasets where this data i have test it on weka with J48 classifier It give me an output = 87.2611% Total of instances = 157 Correctly Instances = 137 Incorrectly instance = 20 Then i have do a visualize classifier error on my data. However my result have been decrease to: New result = 85.4015% Correctly Instances = 117 Incorrectly instances = 20 Total of instances = 137 Is there any reason for that? Should my result become much better after i do the visualize classifier error?

Read the article
Loading a PyML multiclass classifier... why isn't this working?

- by Michael Aaron Safyan

This is a followup from "Save PyML.classifiers.multi.OneAgainstRest(SVM()) object?". I am using PyML for a computer vision project (pyimgattr), and have been having trouble storing/loading a multiclass classifier. When attempting to load one of the SVMs in a composite classifier, with loadSVM, I am getting: ValueError: invalid literal for float(): rest Note that this does not happen with the first classifier that I load, only with the second. What is causing this error, and what can I do to get around this so that I can properly load the classifier? Details To better understand the trouble I'm running into, you may want to look at pyimgattr.py (currently revision 11). I am invoking the program with "./pyimgattr.py train" which trains the classifier (invokes train on line 571, which trains the classifier with trainmulticlassclassifier on line 490 and saves it with storemulticlassclassifier on line 529), and then invoking the program with "./pyimgattr.py test" which loads the classifier in order to test it with the testing dataset (invokes test on line 628, which invokes loadmulticlassclassifier on line 549). The multiclass classifier consists of several one-against-rest SVMs which are saved individually. The loadmulticlassclassifier function loads these individually by calling loadSVM() on several different files. It is in this call to loadSVM (done indirectly in loadclassifier on line 517) that I get an error. The first of the one-against-rest classifiers loads successfully, but the second one does not. A transcript is as follows: $ ./pyimgattr.py test [INFO] pyimgattr -- Loading attributes from "classifiers/attributes.lst"... [INFO] pyimgattr -- Loading classnames from "classifiers/classnames.lst"... [INFO] pyimgattr -- Loading dataset "attribute_data/apascal_test.txt"... [INFO] pyimgattr -- Loaded dataset "attribute_data/apascal_test.txt". [INFO] pyimgattr -- Loading multiclass classifier from "classifiers/classnames_from_attributes"... [INFO] pyimgattr -- Constructing object into which to store loaded data... [INFO] pyimgattr -- Loading manifest data... [INFO] pyimgattr -- Loading classifier from "classifiers/classnames_from_attributes/aeroplane.svm".... scanned 100 patterns scanned 200 patterns read 100 patterns read 200 patterns {'50': 38, '60': 45, '61': 46, '62': 47, '49': 37, '52': 39, '53': 40, '24': 16, '25': 17, '26': 18, '27': 19, '20': 12, '21': 13, '22': 14, '23': 15, '46': 34, '47': 35, '28': 20, '29': 21, '40': 32, '41': 33, '1': 1, '0': 0, '3': 3, '2': 2, '5': 5, '4': 4, '7': 7, '6': 6, '8': 8, '58': 44, '39': 31, '38': 30, '15': 9, '48': 36, '16': 10, '19': 11, '32': 24, '31': 23, '30': 22, '37': 29, '36': 28, '35': 27, '34': 26, '33': 25, '55': 42, '54': 41, '57': 43} read 250 patterns in LinearSparseSVModel done LinearSparseSVModel constructed model [INFO] pyimgattr -- Loaded classifier from "classifiers/classnames_from_attributes/aeroplane.svm". [INFO] pyimgattr -- Loading classifier from "classifiers/classnames_from_attributes/bicycle.svm".... label at None delimiter , Traceback (most recent call last): File "./pyimgattr.py", line 797, in sys.exit(main(sys.argv)); File "./pyimgattr.py", line 782, in main return test(attributes_file,classnames_file,testing_annotations_file,testing_dataset_path,classifiers_path,logger); File "./pyimgattr.py", line 635, in test multiclass_classnames_from_attributes_classifier = loadmulticlassclassifier(classnames_from_attributes_folder,logger); File "./pyimgattr.py", line 529, in loadmulticlassclassifier classifiers.append(loadclassifier(os.path.join(filename,label+".svm"),logger)); File "./pyimgattr.py", line 502, in loadclassifier result=loadSVM(filename,datasetClass = SparseDataSet); File "/Library/Python/2.6/site-packages/PyML/classifiers/svm.py", line 328, in loadSVM data = datasetClass(fileName, **args) File "/Library/Python/2.6/site-packages/PyML/containers/vectorDatasets.py", line 224, in __init__ BaseVectorDataSet.__init__(self, arg, **args) File "/Library/Python/2.6/site-packages/PyML/containers/baseDatasets.py", line 214, in __init__ self.constructFromFile(arg, **args) File "/Library/Python/2.6/site-packages/PyML/containers/baseDatasets.py", line 243, in constructFromFile for x in parser : File "/Library/Python/2.6/site-packages/PyML/containers/parsers.py", line 426, in next x = [float(token) for token in tokens[self._first:self._last]] ValueError: invalid literal for float(): rest

Read the article
Maven Assembly: include a dependency with a different classifier

- by James Kingsbery

I would like to build two different versions of a WAR in Maven (I know that's a no-no, that's just the way it is given the current situation). In the version of a WAR depicted by an assembly, I want to replace a dependency by the same dependency with a different classifier. For example, I was expecting this assembly to work: <assembly> <id>end-user</id> <formats> <format>war</format> </formats> <dependencySets> <dependencySet> <excludes> <exclude>group:artifact:jar:${project.version}</exclude> </excludes> <includes> <include>group:artifact:jar:${project.version}:end-user</include> </includes> </dependencySet> </dependencySets> </assembly> This doesn't work, but am I heading in the right direction? I've already read all the pages on the Maven assembly page and the section on the Maven Definitive Guide that seems relevant. Any pointers would be helpful.

Read the article
SQL SERVER – Fix: Error: 10920 Cannot drop user-defined function. It is being used as a resource governor classifier

- by pinaldave

If you have not read my SQL SERVER – Simple Example to Configure Resource Governor – Introduction to Resource Governor yesterday’s detailed primer on Resource Governor, I suggest you go ahead and read it before continuing this article. After reading the article the very first email I received was as follows: “Pinal, I configured resource governor on my development server and it worked fine with tests I ran. After doing some tests, I decided to remove the resource governor and as a first step I disabled it however, I was not able to drop the classification function during the process of the clean up. It was continuously giving me following error. Msg 10920, Level 16, State 1, Line 1 Cannot drop user-defined function myudfname. It is being used as a resource governor classifier. Would you please give me solution?” The original email was really this short and there is no other information. I am glad he has done experiments on development server and not on the production server. Production server must not be the playground of the experiments. I think I have covered the answer of this error in an earlier blog post. If the user disables the Resource Governor it is still not possible to drop the function because it can be enabled again and when enabled it can still use the same function. Here is the simple resolution of the how one can drop the classifier function (do this only if you are not going to use the function). The reason the classifier function can’t be dropped because it is associated with resource governor. Create a new classified function for your resource governor or just assign NULL as described in the following T-SQL Script and you will be able to drop the function without error. ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = NULL) GO ALTER RESOURCE GOVERNOR DISABLE GO DROP FUNCTION dbo.UDFClassifier GO I am glad that user asked me question instead of doing something radically different, which can leave the server in the unusable state. I am aware of this only method to avoid this error. Is there any better way to achieve the same? Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Error Messages, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

Read the article
Any Naive Bayesian Classifier in python?

- by asldkncvas

Dear Everyone I have tried the Orange Framework for Naive Bayesian classification. The methods are extremely unintuitive, and the documentation is extremely unorganized. Does anyone here have another framework to recommend? I use mostly NaiveBayesian for now. I was thinking of using nltk's NaiveClassification but then they don't think they can handle continuous variables. What are my options?

Read the article
Are there any free openCV classifier libraries?

- by Khorkrak

Been tinkering with openCV in python. The face detection demo is impressive. Are there any free collection of Haar classifiers aside from the face, eyes and full body ones?

Read the article
PyML 0.7.2 - How to prevent accuracy from dropping after storing/loading a classifier?

- by Michael Aaron Safyan

This is a followup from "Save PyML.classifiers.multi.OneAgainstRest(SVM()) object?". The solution to that question was close, but not quite right, (the SparseDataSet is broken, so attempting to save/load with that dataset container type will fail, no matter what. Also, PyML is inconsistent in terms of whether labels should be numbers or strings... it turns out that the oneAgainstRest function is actually not good enough, because the labels need to be strings and simultaneously convertible to floats, because there are places where it is assumed to be a string and elsewhere converted to float) and so after a great deal of hacking and such I was finally able to figure out a way to save and load my multi-class classifier without it blowing up with an error.... however, although it is no longer giving me an error message, it is still not quite right as the accuracy of the classifier drops significantly when it is saved and then reloaded (so I'm still missing a piece of the puzzle). I am currently using the following custom mutli-class classifier for training, saving, and loading: class SVM(object): def __init__(self,features_or_filename,labels=None,kernel=None): if isinstance(features_or_filename,str): filename=features_or_filename; if labels!=None: raise ValueError,"Labels must be None if loading from a file."; with open(os.path.join(filename,"uniquelabels.list"),"rb") as uniquelabelsfile: self.uniquelabels=sorted(list(set(pickle.load(uniquelabelsfile)))); self.labeltoindex={}; for idx,label in enumerate(self.uniquelabels): self.labeltoindex[label]=idx; self.classifiers=[]; for classidx, classname in enumerate(self.uniquelabels): self.classifiers.append(PyML.classifiers.svm.loadSVM(os.path.join(filename,str(classname)+".pyml.svm"),datasetClass = PyML.VectorDataSet)); else: features=features_or_filename; if labels==None: raise ValueError,"Labels must not be None when training."; self.uniquelabels=sorted(list(set(labels))); self.labeltoindex={}; for idx,label in enumerate(self.uniquelabels): self.labeltoindex[label]=idx; points = [[float(xij) for xij in xi] for xi in features]; self.classifiers=[PyML.SVM(kernel) for label in self.uniquelabels]; for i in xrange(len(self.uniquelabels)): currentlabel=self.uniquelabels[i]; currentlabels=['+1' if k==currentlabel else '-1' for k in labels]; currentdataset=PyML.VectorDataSet(points,L=currentlabels,positiveClass='+1'); self.classifiers[i].train(currentdataset,saveSpace=False); def accuracy(self,pts,labels): logger=logging.getLogger("ml"); correct=0; total=0; classindexes=[self.labeltoindex[label] for label in labels]; h=self.hypotheses(pts); for idx in xrange(len(pts)): if h[idx]==classindexes[idx]: logger.info("RIGHT: Actual \"%s\" == Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ])); correct+=1; else: logger.info("WRONG: Actual \"%s\" != Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ])) total+=1; return float(correct)/float(total); def prediction(self,pt): h=self.hypothesis(pt); if h!=None: return self.uniquelabels[h]; return h; def predictions(self,pts): h=self.hypotheses(self,pts); return [self.uniquelabels[x] if x!=None else None for x in h]; def hypothesis(self,pt): bestvalue=None; bestclass=None; dataset=PyML.VectorDataSet([pt]); for classidx, classifier in enumerate(self.classifiers): val=classifier.decisionFunc(dataset,0); if (bestvalue==None) or (val>bestvalue): bestvalue=val; bestclass=classidx; return bestclass; def hypotheses(self,pts): bestvalues=[None for pt in pts]; bestclasses=[None for pt in pts]; dataset=PyML.VectorDataSet(pts); for classidx, classifier in enumerate(self.classifiers): for ptidx in xrange(len(pts)): val=classifier.decisionFunc(dataset,ptidx); if (bestvalues[ptidx]==None) or (val>bestvalues[ptidx]): bestvalues[ptidx]=val; bestclasses[ptidx]=classidx; return bestclasses; def save(self,filename): if not os.path.exists(filename): os.makedirs(filename); with open(os.path.join(filename,"uniquelabels.list"),"wb") as uniquelabelsfile: pickle.dump(self.uniquelabels,uniquelabelsfile,pickle.HIGHEST_PROTOCOL); for classidx, classname in enumerate(self.uniquelabels): self.classifiers[classidx].save(os.path.join(filename,str(classname)+".pyml.svm")); I am using the latest version of PyML (0.7.2, although PyML.__version__ is 0.7.0). When I construct the classifier with a training dataset, the reported accuracy is ~0.87. When I then save it and reload it, the accuracy is less than 0.001. So, there is something here that I am clearly not persisting correctly, although what that may be is completely non-obvious to me. Would you happen to know what that is?

Read the article
PyML 0.7.2 - How to prevent accuracy from dropping after stroing/loading a classifier?

- by Michael Aaron Safyan

This is a followup from "Save PyML.classifiers.multi.OneAgainstRest(SVM()) object?". The solution to that question was close, but not quite right, (the SparseDataSet is broken, so attempting to save/load with that dataset container type will fail, no matter what. Also, PyML is inconsistent in terms of whether labels should be numbers or strings... it turns out that the oneAgainstRest function is actually not good enough, because the labels need to be strings and simultaneously convertible to floats, because there are places where it is assumed to be a string and elsewhere converted to float) and so after a great deal of hacking and such I was finally able to figure out a way to save and load my multi-class classifier without it blowing up with an error.... however, although it is no longer giving me an error message, it is still not quite right as the accuracy of the classifier drops significantly when it is saved and then reloaded (so I'm still missing a piece of the puzzle). I am currently using the following custom mutli-class classifier for training, saving, and loading: class SVM(object): def __init__(self,features_or_filename,labels=None,kernel=None): if isinstance(features_or_filename,str): filename=features_or_filename; if labels!=None: raise ValueError,"Labels must be None if loading from a file."; with open(os.path.join(filename,"uniquelabels.list"),"rb") as uniquelabelsfile: self.uniquelabels=sorted(list(set(pickle.load(uniquelabelsfile)))); self.labeltoindex={}; for idx,label in enumerate(self.uniquelabels): self.labeltoindex[label]=idx; self.classifiers=[]; for classidx, classname in enumerate(self.uniquelabels): self.classifiers.append(PyML.classifiers.svm.loadSVM(os.path.join(filename,str(classname)+".pyml.svm"),datasetClass = PyML.VectorDataSet)); else: features=features_or_filename; if labels==None: raise ValueError,"Labels must not be None when training."; self.uniquelabels=sorted(list(set(labels))); self.labeltoindex={}; for idx,label in enumerate(self.uniquelabels): self.labeltoindex[label]=idx; points = [[float(xij) for xij in xi] for xi in features]; self.classifiers=[PyML.SVM(kernel) for label in self.uniquelabels]; for i in xrange(len(self.uniquelabels)): currentlabel=self.uniquelabels[i]; currentlabels=['+1' if k==currentlabel else '-1' for k in labels]; currentdataset=PyML.VectorDataSet(points,L=currentlabels,positiveClass='+1'); self.classifiers[i].train(currentdataset,saveSpace=False); def accuracy(self,pts,labels): logger=logging.getLogger("ml"); correct=0; total=0; classindexes=[self.labeltoindex[label] for label in labels]; h=self.hypotheses(pts); for idx in xrange(len(pts)): if h[idx]==classindexes[idx]: logger.info("RIGHT: Actual \"%s\" == Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ])); correct+=1; else: logger.info("WRONG: Actual \"%s\" != Predicted \"%s\"" %(self.uniquelabels[ classindexes[idx] ], self.uniquelabels[ h[idx] ])) total+=1; return float(correct)/float(total); def prediction(self,pt): h=self.hypothesis(pt); if h!=None: return self.uniquelabels[h]; return h; def predictions(self,pts): h=self.hypotheses(self,pts); return [self.uniquelabels[x] if x!=None else None for x in h]; def hypothesis(self,pt): bestvalue=None; bestclass=None; dataset=PyML.VectorDataSet([pt]); for classidx, classifier in enumerate(self.classifiers): val=classifier.decisionFunc(dataset,0); if (bestvalue==None) or (val>bestvalue): bestvalue=val; bestclass=classidx; return bestclass; def hypotheses(self,pts): bestvalues=[None for pt in pts]; bestclasses=[None for pt in pts]; dataset=PyML.VectorDataSet(pts); for classidx, classifier in enumerate(self.classifiers): for ptidx in xrange(len(pts)): val=classifier.decisionFunc(dataset,ptidx); if (bestvalues[ptidx]==None) or (val>bestvalues[ptidx]): bestvalues[ptidx]=val; bestclasses[ptidx]=classidx; return bestclasses; def save(self,filename): if not os.path.exists(filename): os.makedirs(filename); with open(os.path.join(filename,"uniquelabels.list"),"wb") as uniquelabelsfile: pickle.dump(self.uniquelabels,uniquelabelsfile,pickle.HIGHEST_PROTOCOL); for classidx, classname in enumerate(self.uniquelabels): self.classifiers[classidx].save(os.path.join(filename,str(classname)+".pyml.svm")); I am using the latest version of PyML (0.7.2, although PyML.__version__ is 0.7.0). When I construct the classifier with a training dataset, the reported accuracy is ~0.87. When I then save it and reload it, the accuracy is less than 0.001. So, there is something here that I am clearly not persisting correctly, although what that may be is completely non-obvious to me. Would you happen to know what that is?

Read the article
Sentiment analysis with NLTK python for sentences using sample data or webservice?

- by Ke

I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice) I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??) Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron" Then I would like to check for positive/negative sentiment in each sentence and count them accordingly NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm. Here are the troubles I am having: All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?) I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary? I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link? Any help is much appreciated and will save me much hair! Cheers Ke

Read the article
Bag of words Classification

- by AlgoMan

I need find words training words and their classification. Simple classification such as . Sports Entertainment and Politics things like that. Where Can i find the words and their classifications. I know many universities have done Bag of words classifications. Is there any repository of training examples ?

Read the article
How to persist non-trivial fields in Play Framework

- by AlexR

I am trying to persist complex objects using Ebeans in Play Framework (2.03). In particular, I've created a class that contains a field of type weka.classifier.Classifier (Weka is a popular machine learning library - see http://weka.sourceforge.net/doc/weka/classifiers/Classifier.html). Classifier implements Serializeable so I hoped that I can get away with something like @Entity @Table(name = "classifiers") public class ClassifierData extends Model { @Id public Long id; public Classifier classifier; } However, the Evolutions script suggests the following database structure: create table classifiers ( id bigint auto_increment not null, constraint pk_classifiers primary key (id)) ) In other words, it ignores the field of type Classifier. (The database is MySQL if it makes any difference) What should I do to store complex serializeable objects using Ebean/Evolutions/PlayFramework?

Read the article
How do I classify using GLCM and SVM Classifier in Matlab?

- by Gomathi

I'm on a project of liver tumor segmentation and classification. I used Region Growing and FCM for liver and tumor segmentation respectively. Then, I used Gray Level Co-occurence matrix for texture feature extraction. I have to use Support Vector Machine for Classification. But I don't know how to normalize the feature vectors. Can anyone tell how to program it in Matlab? To the GLCM program, I gave the tumor segmented image as input. Was I correct? If so, I think, then, my output will also be correct. My glcm coding, as far as I have tried is, I = imread('fzliver3.jpg'); GLCM = graycomatrix(I,'Offset',[2 0;0 2]); stats = graycoprops(GLCM,'all') t1= struct2array(stats) I2 = imread('fzliver4.jpg'); GLCM2 = graycomatrix(I2,'Offset',[2 0;0 2]); stats2 = graycoprops(GLCM2,'all') t2= struct2array(stats2) I3 = imread('fzliver5.jpg'); GLCM3 = graycomatrix(I3,'Offset',[2 0;0 2]); stats3 = graycoprops(GLCM3,'all') t3= struct2array(stats3) t=[t1;t2;t3] xmin = min(t); xmax = max(t); scale = xmax-xmin; tf=(x-xmin)/scale Was this a correct implementation? Also, I get an error at the last line. My output is: stats = Contrast: [0.0510 0.0503] Correlation: [0.9513 0.9519] Energy: [0.8988 0.8988] Homogeneity: [0.9930 0.9935] t1 = Columns 1 through 6 0.0510 0.0503 0.9513 0.9519 0.8988 0.8988 Columns 7 through 8 0.9930 0.9935 stats2 = Contrast: [0.0345 0.0339] Correlation: [0.8223 0.8255] Energy: [0.9616 0.9617] Homogeneity: [0.9957 0.9957] t2 = Columns 1 through 6 0.0345 0.0339 0.8223 0.8255 0.9616 0.9617 Columns 7 through 8 0.9957 0.9957 stats3 = Contrast: [0.0230 0.0246] Correlation: [0.7450 0.7270] Energy: [0.9815 0.9813] Homogeneity: [0.9971 0.9970] t3 = Columns 1 through 6 0.0230 0.0246 0.7450 0.7270 0.9815 0.9813 Columns 7 through 8 0.9971 0.9970 t = Columns 1 through 6 0.0510 0.0503 0.9513 0.9519 0.8988 0.8988 0.0345 0.0339 0.8223 0.8255 0.9616 0.9617 0.0230 0.0246 0.7450 0.7270 0.9815 0.9813 Columns 7 through 8 0.9930 0.9935 0.9957 0.9957 0.9971 0.9970 ??? Error using ==> minus Matrix dimensions must agree. The images are:

Read the article
How do I classify using SVM Classifier in Matlab?

- by Gomathi

I'm on a project of liver tumor segmentation and classification. I used Region Growing and FCM for liver and tumor segmentation respectively. Then, I used Gray Level Co-occurence matrix for texture feature extraction. I have to use Support Vector Machine for Classification. But I don't know how to normalize the feature vectors. Can anyone tell how to program it in Matlab? To the GLCM program, I gave the tumor segmented image as input. Was I correct? If so, I think, then, my output will also be correct. I gave the parameters exactly as in the example provided in the documentation itself. The output I obtained was stats = autoc: [1.857855266614132e+000 1.857955341199538e+000] contr: [5.103143332457753e-002 5.030548650257343e-002] corrm: [9.512661919561399e-001 9.519459060378332e-001] corrp: [9.512661919561385e-001 9.519459060378338e-001] cprom: [7.885631654779597e+001 7.905268525471267e+001] cshad: [1.219440700252286e+001 1.220659371449108e+001] dissi: [2.037387269065756e-002 1.935418927908687e-002] energ: [8.987753042491253e-001 8.988459843719526e-001] entro: [2.759187341212805e-001 2.743152140681436e-001] homom: [9.930016927881388e-001 9.935307908219834e-001] homop: [9.925660617240367e-001 9.930960070222014e-001] maxpr: [9.474275457490587e-001 9.474466930429607e-001] sosvh: [1.847174384255155e+000 1.846913030238459e+000] savgh: [2.332207337361002e+000 2.332108469591401e+000] svarh: [6.311174784234007e+000 6.314794324825067e+000] senth: [2.663144677055123e-001 2.653725436772341e-001] dvarh: [5.103143332457753e-002 5.030548650257344e-002] denth: [7.573115918713391e-002 7.073380266499811e-002] inf1h: [-8.199645492654247e-001 -8.265514568489666e-001] inf2h: [5.643539051044213e-001 5.661543271625117e-001] indnc: [9.980238521073823e-001 9.981394883569174e-001] idmnc: [9.993275086521848e-001 9.993404634013308e-001] The thing is, I run the program for three images. But all three gave me the same output. When I used graycoprops() stat = Contrast: 4.721877658740964e+005 Correlation: -3.282870417955449e-003 Energy: 8.647689474127760e-006 Homogeneity: 8.194621855726478e-003 stat = Contrast: 2.817160447307697e+004 Correlation: 2.113032196952781e-005 Energy: 4.124904827799189e-004 Homogeneity: 2.513567163994905e-002 stat = Contrast: 7.086638436309059e+004 Correlation: 2.459637878221028e-002 Energy: 4.640677159445994e-004 Homogeneity: 1.158305728309460e-002 The images are:

Read the article
What would be a good language to implement a naive bayes classifier from scratch?

- by kita

I would like to implement a naive bayes classifier for spam filtering from scratch as a learning exercise. What would be the best langauge of the following to try this out in? Java Ruby C++ C something else Please give reasons (it would help greatly!)

Read the article
What algorithms are suitable for this simple machine learning problem?

- by user213060

I have a what I think is a simple machine learning question. Here is the basic problem: I am repeatedly given a new object and a list of descriptions about the object. For example: new_object: 'bob' new_object_descriptions: ['tall','old','funny']. I then have to use some kind of machine learning to find previously handled objects that had similar descriptions, for example, past_similar_objects: ['frank','steve','joe']. Next, I have an algorithm that can directly measure whether these objects are indeed similar to bob, for example, correct_objects: ['steve','joe']. The classifier is then given this feedback training of successful matches. Then this loop repeats with a new object. a Here's the pseudo-code: Classifier=new_classifier() while True: new_object,new_object_descriptions = get_new_object_and_descriptions() past_similar_objects = Classifier.classify(new_object,new_object_descriptions) correct_objects = calc_successful_matches(new_object,past_similar_objects) Classifier.train_successful_matches(object,correct_objects) But, there are some stipulations that may limit what classifier can be used: There will be millions of objects put into this classifier so classification and training needs to scale well to millions of object types and still be fast. I believe this disqualifies something like a spam classifier that is optimal for just two types: spam or not spam. (Update: I could probably narrow this to thousands of objects instead of millions, if that is a problem.) Again, I prefer speed when millions of objects are being classified, over accuracy. What are decent, fast machine learning algorithms for this purpose?

Read the article
Save PyML.classifiers.multi.OneAgainstRest(SVM()) object?

- by Michael Aaron Safyan

I'm using PYML to construct a multiclass linear support vector machine (SVM). After training the SVM, I would like to be able to save the classifier, so that on subsequent runs I can use the classifier right away without retraining. Unfortunately, the .save() function is not implemented for that classifier, and attempting to pickle it (both with standard pickle and cPickle) yield the following error message: pickle.PicklingError: Can't pickle : it's not found as __builtin__.PySwigObject Does anyone know of a way around this or of an alternative library without this problem? Thanks. Edit/Update I am now training and attempting to save the classifier with the following code: mc = multi.OneAgainstRest(SVM()); mc.train(dataset_pyml,saveSpace=False); for i, classifier in enumerate(mc.classifiers): filename=os.path.join(prefix,labels[i]+".svm"); classifier.save(filename); Notice that I am now saving with the PyML save mechanism rather than with pickling, and that I have passed "saveSpace=False" to the training function. However, I am still gettting an error: ValueError: in order to save a dataset you need to train as: s.train(data, saveSpace = False) However, I am passing saveSpace=False... so, how do I save the classifier(s)? P.S. The project I am using this in is pyimgattr, in case you would like a complete testable example... the program is run with "./pyimgattr.py train"... that will get you this error. Also, a note on version information: [michaelsafyan@codemage /Volumes/Storage/classes/cse559/pyimgattr]$ python Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. import PyML print PyML.__version__ 0.7.0

Read the article
Equivalent of #map in ruby in golang

- by Oct

I'm playing with Go and run into something I'm unable to find in Google, although there is certainly something that exists: I'm using the following struct: type Syntax struct { name string extensions *regexp.Regexp } type Scanner struct { classifier * bayesian.Classifier save_file string name_to_syntax map[string] *Syntax extensions_to_syntax map[*regexp.Regexp] *Syntax } I'd like to perform the following using Go and I'm quoting ruby because it's how I'd do that using ruby: test_regexpes = my_scanner.extensions_to_syntax.keys My goal is to get an array of *regexp.Regexp . Any idea on how to do that in a simple way ? Thank you !

Read the article
T-SQL Tuesday #53-Matt's Making Me Do This!

- by Most Valuable Yak (Rob Volk)

Hello everyone! It's that time again, time for T-SQL Tuesday, the wonderful blog series started by Adam Machanic (b|t). This month we are hosted by Matt Velic (b|t) who asks the question, "Why So Serious?", in celebration of April Fool's Day. He asks the contributors for their dirty tricks. And for some reason that escapes me, he and Jeff Verheul (b|t) seem to think I might be able to write about those. Shocked, I am! Nah, not really. They're absolutely right, this one is gonna be fun! I took some inspiration from Matt's suggestions, namely Resource Governor and Login Triggers. I've done some interesting login trigger stuff for a presentation, but nothing yet with Resource Governor. Best way to learn it! One of my oldest pet peeves is abuse of the sa login. Don't get me wrong, I use it too, but typically only as SQL Agent job owner. It's been a while since I've been stuck with it, but back when I started using SQL Server, EVERY application needed sa to function. It was hard-coded and couldn't be changed. (welllllll, that is if you didn't use a hex editor on the EXE file, but who would do such a thing?) My standard warning applies: don't run anything on this page in production. In fact, back up whatever server you're testing this on, including the master database. Snapshotting a VM is a good idea. Also make sure you have other sysadmin level logins on that server. So here's a standard template for a logon trigger to address those pesky sa users: CREATE TRIGGER SA_LOGIN_PRIORITY ON ALL SERVER WITH ENCRYPTION, EXECUTE AS N'sa' AFTER LOGON AS IF ORIGINAL_LOGIN()<>N'sa' OR APP_NAME() LIKE N'SQL Agent%' RETURN; -- interesting stuff goes here GO What can you do for "interesting stuff"? Books Online limits itself to merely rolling back the logon, which will throw an error (and alert the person that the logon trigger fired). That's a good use for logon triggers, but really not tricky enough for this blog. Some of my suggestions are below: WAITFOR DELAY '23:59:59'; Or: EXEC sp_MSforeach_db 'EXEC sp_detach_db ''?'';' Or: EXEC msdb.dbo.sp_add_job @job_name=N'`', @enabled=1, @start_step_id=1, @notify_level_eventlog=0, @delete_level=3; EXEC msdb.dbo.sp_add_jobserver @job_name=N'`', @server_name=@@SERVERNAME; EXEC msdb.dbo.sp_add_jobstep @job_name=N'`', @step_id=1, @step_name=N'`', @command=N'SHUTDOWN;'; EXEC msdb.dbo.sp_start_job @job_name=N'`'; Really, I don't want to spoil your own exploration, try it yourself! The thing I really like about these is it lets me promote the idea that "sa is SLOW, sa is BUGGY, don't use sa!". Before we get into Resource Governor, make sure to drop or disable that logon trigger. They don't work well in combination. (Had to redo all the following code when SSMS locked up) Resource Governor is a feature that lets you control how many resources a single session can consume. The main goal is to limit the damage from a runaway query. But we're not here to read about its main goal or normal usage! I'm trying to make people stop using sa BECAUSE IT'S SLOW! Here's how RG can do that: USE master; GO CREATE FUNCTION dbo.SA_LOGIN_PRIORITY() RETURNS sysname WITH SCHEMABINDING, ENCRYPTION AS BEGIN RETURN CASE WHEN ORIGINAL_LOGIN()=N'sa' AND APP_NAME() NOT LIKE N'SQL Agent%' THEN N'SA_LOGIN_PRIORITY' ELSE N'default' END END GO CREATE RESOURCE POOL SA_LOGIN_PRIORITY WITH ( MIN_CPU_PERCENT = 0 ,MAX_CPU_PERCENT = 1 ,CAP_CPU_PERCENT = 1 ,AFFINITY SCHEDULER = (0) ,MIN_MEMORY_PERCENT = 0 ,MAX_MEMORY_PERCENT = 1 -- ,MIN_IOPS_PER_VOLUME = 1 ,MAX_IOPS_PER_VOLUME = 1 -- uncomment for SQL Server 2014 ); CREATE WORKLOAD GROUP SA_LOGIN_PRIORITY WITH ( IMPORTANCE = LOW ,REQUEST_MAX_MEMORY_GRANT_PERCENT = 1 ,REQUEST_MAX_CPU_TIME_SEC = 1 ,REQUEST_MEMORY_GRANT_TIMEOUT_SEC = 1 ,MAX_DOP = 1 ,GROUP_MAX_REQUESTS = 1 ) USING SA_LOGIN_PRIORITY; ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION=dbo.SA_LOGIN_PRIORITY); ALTER RESOURCE GOVERNOR RECONFIGURE; From top to bottom: Create a classifier function to determine which pool the session should go to. More info on classifier functions. Create the pool and provide a generous helping of resources for the sa login. Create the workload group and further prioritize those resources for the sa login. Apply the classifier function and reconfigure RG to use it. I have to say this one is a bit sneakier than the logon trigger, least of all you don't get any error messages. I heartily recommend testing it in Management Studio, and click around the UI a lot, there's some fun behavior there. And DEFINITELY try it on SQL 2014 with the IO settings included! You'll notice I made allowances for SQL Agent jobs owned by sa, they'll go into the default workload group. You can add your own overrides to the classifier function if needed. Some interesting ideas I didn't have time for but expect you to get to before me: Set up different pools/workgroups with different settings and randomize which one the classifier chooses Do the same but base it on time of day (Books Online example covers this)... Or, which workstation it connects from. This can be modified for certain special people in your office who either don't listen, or are attracted (and attractive) to you. And if things go wrong you can always use the following from another sysadmin or Dedicated Admin connection: ALTER RESOURCE GOVERNOR DISABLE; That will let you go in and either fix (or drop) the pools, workgroups and classifier function. So now that you know these types of things are possible, and if you are tired of your team using sa when they shouldn't, I expect you'll enjoy playing with these quite a bit! Unfortunately, the aforementioned Dedicated Admin Connection kinda poops on the party here. Books Online for both topics will tell you that the DAC will not fire either feature. So if you have a crafty user who does their research, they can still sneak in with sa and do their bidding without being hampered. Of course, you can still detect their login via various methods, like a server trace, SQL Server Audit, extended events, and enabling "Audit Successful Logins" on the server. These all have their downsides: traces take resources, extended events and SQL Audit can't fire off actions, and enabling successful logins will bloat your error log very quickly. SQL Audit is also limited unless you have Enterprise Edition, and Resource Governor is Enterprise-only. And WORST OF ALL, these features are all available and visible through the SSMS UI, so even a doofus developer or manager could find them. Fortunately there are Event Notifications! Event notifications are becoming one of my favorite features of SQL Server (keep an eye out for more blogs from me about them). They are practically unknown and heinously underutilized. They are also a great gateway drug to using Service Broker, another great but underutilized feature. Hopefully this will get you to start using them, or at least your enemies in the office will once they read this, and then you'll have to learn them in order to fix things. So here's the setup: USE msdb; GO CREATE PROCEDURE dbo.SA_LOGIN_PRIORITY_act WITH ENCRYPTION AS DECLARE @x XML, @message nvarchar(max); RECEIVE @x=CAST(message_body AS XML) FROM SA_LOGIN_PRIORITY_q; IF @x.value('(//LoginName)[1]','sysname')=N'sa' AND @x.value('(//ApplicationName)[1]','sysname') NOT LIKE N'SQL Agent%' BEGIN -- interesting activation procedure stuff goes here END GO CREATE QUEUE SA_LOGIN_PRIORITY_q WITH STATUS=ON, RETENTION=OFF, ACTIVATION (PROCEDURE_NAME=dbo.SA_LOGIN_PRIORITY_act, MAX_QUEUE_READERS=1, EXECUTE AS OWNER); CREATE SERVICE SA_LOGIN_PRIORITY_s ON QUEUE SA_LOGIN_PRIORITY_q([http://schemas.microsoft.com/SQL/Notifications/PostEventNotification]); CREATE EVENT NOTIFICATION SA_LOGIN_PRIORITY_en ON SERVER WITH FAN_IN FOR AUDIT_LOGIN TO SERVICE N'SA_LOGIN_PRIORITY_s', N'current database' GO From top to bottom: Create activation procedure for event notification queue. Create queue to accept messages from event notification, and activate the procedure to process those messages when received. Create service to send messages to that queue. Create event notification on AUDIT_LOGIN events that fire the service. I placed this in msdb as it is an available system database and already has Service Broker enabled by default. You should change this to another database if you can guarantee it won't get dropped. So what to put in place for "interesting activation procedure code"? Hmmm, so far I haven't addressed Matt's suggestion of writing a lengthy script to send an annoying message: SET @[email protected]('(//HostName)[1]','sysname') + N' tried to log in to server ' + @x.value('(//ServerName)[1]','sysname') + N' as SA at ' + @x.value('(//StartTime)[1]','sysname') + N' using the ' + @x.value('(//ApplicationName)[1]','sysname') + N' program. That''s why you''re getting this message and the attached pornography which' + N' is bloating your inbox and violating company policy, among other things. If you know' + N' this person you can go to their desk and hit them, or use the following SQL to end their session: KILL ' + @x.value('(//SPID)[1]','sysname') + N'; Hopefully they''re in the middle of a huge query that they need to finish right away.' EXEC msdb.dbo.sp_send_dbmail @recipients=N'[email protected]', @subject=N'SA Login Alert', @query_result_width=32767, @body=@message, @query=N'EXEC sp_readerrorlog;', @attach_query_result_as_file=1, @query_attachment_filename=N'UtterlyGrossPorn_SeriouslyDontOpenIt.jpg' I'm not sure I'd call that a lengthy script, but the attachment should get pretty big, and I'm sure the email admins will love storing multiple copies of it. The nice thing is that this also fires on Dedicated Admin connections! You can even identify DAC connections from the event data returned, I leave that as an exercise for you. You can use that info to change the action taken by the activation procedure, and since it's a stored procedure, it can pretty much do anything! Except KILL the SPID, or SHUTDOWN the server directly. I'm still working on those.

Read the article
Help matching fields between two classes

- by Michael

I'm not too experienced with Java yet, and I'm hoping someone can steer me in the right direction because right now I feel like I'm just beating my head against a wall... The first class is called MeasuredParams, and it's got 40+ numeric fields (height, weight, waistSize, wristSize - some int, but mostly double). The second class is a statistical classifier called Classifier. It's been trained on a subset of the MeasuredParams fields. The names of the fields that the Classifier has been trained on is stored, in order, in an array called reqdFields. What I need to do is load a new array, toClassify, with the values stored in the fields from MeasuredParams that match the field list (including order) found in reqdFields. I can make any changes necessary to the MeasuredParams class, but I'm stuck with Classifier as it is. My brute-force approach was to get rid of the fields in MeasuredParams and use an arrayList instead, and store the field names in an Enum object to act as an index pointer. Then loop through the reqdFields list, one element at a time, and find the matching name in the Enum object to find the correct position in the arrayList. Load the value stored at that positon into toClassify, and then continue on to the next element in reqdFields. I'm not sure how exactly I would search through the Enum object - it would be a lot easier if the field names were stored in a second arrayList. But then the index positions between the two would have to stay matched, and I'm back to using an Enum. I think. I've been running around in circles all afternoon, and I keep thinking there must be an easier way of doing it. I'm just stuck right now and can't see past what I've started. Any help would be GREATLY appreciated. Thanks so much! Michael

Read the article
Maven test dependency in multi module project

- by user209947

I use maven to build a multi module project. My module 2 depends on Module 1 src at compile scope and module 1 tests in test scope. Module 2 - <dependency> <groupId>blah</groupId> <artifactId>MODULE1</artifactId> <version>blah</version> <classifier>tests</classifier> <scope>test</scope> </dependency> This works fine. Say my module 3 depends on Module1 src and tests at compile time. Module 3 - <dependency> <groupId>blah</groupId> <artifactId>MODULE1</artifactId> <version>blah</version> <classifier>tests</classifier> <scope>complie</scope> </dependency> When I run mvn clean install, my build runs till module 3, fails at module 3 as it couldnt resolve the module 1 test dependency. Then I do a mvn install on module 3 alone, go back and run mvn install on my parent pom to make it build. How can i fix this?

Read the article

1 2 3 4 | Next Page >