Search Results

Search found 61 results on 3 pages for 'bayesian'.

Page 2/3 | < Previous Page | 1 2 3 | Next Page >

Measuring the performance of classification algorithm

- by Silver Dragon

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account. That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless. An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training. White papers, book pointers, and PDFs much appreciated!

Read the article
How to filter/sort/rank object model nodes?

- by BCS

I have some kind of object model and I need to filter and sort it's nodes for some kind of property. What kinds of automated systems exist to generate and select properties of the object model that correlate to what I want? (I'm intentionally being abstract and non-specific) I'm thinking of a system that works kind of like spam filters or supervised classification systems in that given an example data set it identifies rules that find nodes of interest. However I'm looking for a more general system in that it shouldn't require any design time information about the object model. It should work equality well as a spam filter on e-mail, a bug finder on a code base, an interest filter in a newsgroup or bot accounts finder on a social networking site. As long as it can explore the object model via reflection and be given a set of "interesting" nodes, it should be able to find rules that will find more nodes like them.

Read the article
Equivalent of #map in ruby in golang

- by Oct

I'm playing with Go and run into something I'm unable to find in Google, although there is certainly something that exists: I'm using the following struct: type Syntax struct { name string extensions *regexp.Regexp } type Scanner struct { classifier * bayesian.Classifier save_file string name_to_syntax map[string] *Syntax extensions_to_syntax map[*regexp.Regexp] *Syntax } I'd like to perform the following using Go and I'm quoting ruby because it's how I'd do that using ruby: test_regexpes = my_scanner.extensions_to_syntax.keys My goal is to get an array of *regexp.Regexp . Any idea on how to do that in a simple way ? Thank you !

Read the article
When to choose which machine learning classifier?

- by LM

Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.) How do I know which classifier I should use? (Decision tree, SVM, Bayesian, logistic regression, etc.) In which cases is one of them the "natural" first choice, and what are the principles for choosing that one? Examples of the type of answers I'm looking for (from Manning et al.'s "Introduction to Information Retrieval book": http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html): a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes). [I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.] b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability. What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though. Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)? [Not sure if stackoverflow is the best place to ask this question, since it's more machine learning than actual programming -- if not, any suggestions for where else?]

Read the article
Speeding up inner-joins and subqueries while restricting row size and table membership

- by hiffy

I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts. The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications. After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates). Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries. The query (takes about 9 seconds on 3600 items with no indexes): insert into stream(entry_id, user_id) select entries.id, subscriptions_users.user_id from entries inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id where subscriptions_users.user_id = 1 and entries.id not in (select entry_id from metadata where metadata.user_id = 1) and entries.id not in (select entry_id from stream where user_id = 1); The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream. Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer). This is close but not quite what I wanted to do. Does anyone have any advice (nosql?) or know a more efficient way of composing the query?

Read the article
Looking for spam mails.

- by apalsson

I am looking for a bundle of spam mails to use with my experiments with Bayesian spam filtering. Is there any resource for this or does anyone have a bundle to share with me? Something around 4-6gb zip:ed would be a nice start. Feel free to filter out your own email-address, though this is for personal use and it won't be shared with anyone. Thanks.

Read the article
Where would you start if you were trying to solve this PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article
Collaborative Filtering Techniques

- by user95261

Good Day! I am in need of help about collaborative filtering techniques implementation in predicting psychopathy of twitter users. I have two data set, training set and test set. Training set users have already scores in psychopathy, I need any collaborative filtering techniques to predict scores of test set users. Collaborative Filtering such as Item/User-Based CF, Bayesian Belief Nets, Clustering, Latent Semantic, etc. Please help me. :( I am very confused on how to implement any of these. Thank you!

Read the article
How to identify a PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article
details on the following Natural Language Processing terms ?

- by wefwgeweg

Named Entity Extraction (extract ppl, cities, organizations) Content Tagging (extract topic tags by scanning doc) Structured Data Extraction Topic Categorization (taxonomy classification by scanning doc....bayesian ) Text extraction (HTML page cleaning) are there libraries that i can use to do any of the above functions of NLP ? dont really feel like forking out cash to AlchemyAPI

Read the article
Anyone know of a good open source spam checker in java or c#?

- by Spines

I'm creating a site where users can write articles and comment on the articles. I want to automatically check to see if a new article or comment is spam. What are good libraries for doing this? I looked at bayesian classifier libraries, but it seems that I would have to gather a large amount of samples and classify them all as spam or not spam myself... I'm looking for something that can hopefully just tell me right out of the box.

Read the article
Training Exchange junk filter via OS X Mail.app

- by Doug Harris

My company uses Exchange 2007 for our mail server (via 3rd party hosted service), I use OS X Mail.app on my MacBook Pro. When viewing my junk folder in the webmail interface, there's a "Not Junk" button which will move the email out of the Junk folder. I can mimic the basic functionality in Mail.app by dragging the message from the junk folder to another folder. But... does this train the server? Specific questions: Does pressing the "Not Junk" button in the webmail interface (or Outlook) do anything more on the server side? Does Exchange used any sort of statistical spam filtering (e.g. bayesian filtering) based on this action? If the answer to #1 is yes, does moving the folder out of the junk folder in Mail.app trigger the statistical learning? If the answer to #2 is no, is there a different way to trigger the learning?

Read the article
User-trainable spam filtering with Exim and Dovecot

- by pascal

Currently I'm using Exim to deliver mails via dovecot-lda into Dovecot mailboxes. I'd like to add spam filtering, but I don't want to reject false-positives in Exim, and I want to train the bayesian filter from the client. So: How do I configure a spamd such that spam lands in a Junk folder, and when the user finds spam in their Inbox, or ham in their Junk, they move it to the correct box which trains the spamd. I have found dovecot-antispam but I'm not sure about its quality, it only seems to support dspam (and crm114, which is dead, last release August 2009).

Read the article
How to integrate Thunderbird with SpamAssassin running on the server?

- by haimg

I'm trying to integrate SpamAssassin running on a server with Thunderbird. Basically I need to be able to select several emails in Thunderbird and send them back to SpamAssassin for training, either as spam or ham. I tried several approaches: Tried "Report Spam" plugin, which is able to send message back to server either as an email attachment or via HTTP post. However, the plugin is rather buggy... Does not support sending several messages at once, "report as ham" is not working at all, etc. Wanted to make a custom button that will copy selected messages to a separate IMAP folder (I could create "LearnAsSpam" and "LearnAsHam" folders in IMAP that will get processed automatically on server), but don't even know how to approach this in Thunderbird, don't want to learn Thunderbird extention authoring... Server-side, I'm prepared to do some custom programming or integration needed (can receive a message via HTTP / SMTP / whatever), my stumbling block is Thunderbird... So, how can I send emails from Thunderbird back to SpamAssassin running on email server for Bayesian training, with as few keystrokes as possible?

Read the article
CodePlex Daily Summary for Monday, November 07, 2011

CodePlex Daily Summary for Monday, November 07, 2011Popular ReleasesGoogleMap Control: GoogleMap Control 6.0: Major design changes to the control in order to achieve better scalability and extensibility for the new features comming with GoogleMaps API. GoogleMap control switched to GoogleMaps API v3 and .NET 4.0. GoogleMap control is 100% ScriptControl now, it requires ScriptManager to be registered on the pages where and before it is used. Markers, polylines, polygons and directions were implemented as ExtenderControl, instead of being inner properties of GoogleMap control. Better perfomance. Better...WDTVHubGen - Adds Metadata, thumbnails and subtitles to WDTV Live Hubs: V2.1: Version 2.1 (click on the right) this uses V4.0 of .net Version 2.1 adds the following features: (apologize if I forget some, added a lot of little things) Manual Lookup with TV or Movie (finally huh!), you can look up a movie or TV episode directly, you can right click on anythign, and choose manual lookup, then will allow you to type anything you want to look up and it will assign it to the file you right clicked. No Rename: a very popular request, this is an option you can set so that t...Bulk Copy Test Cases Tool for Microsoft Test Manager & TFS: Bulk Copy Test Cases Tool: A while ago I had written a blog post Microsoft Test Manager Test Case Versioning on how to manage Test Cases over multiple releases which required you to manually copy test cases individually. Now there is a tool to help with the bulk copying of Test Cases that updates the Iteration field at the same time.Self-Tracking Entity Generator for WPF and Silverlight: Self-Tracking Entity Generator v 0.9.9 Update 2: Self-Tracking Entity Generator v 0.9.9 for Entity Framework 4.0. No change to the self-tracking entity generator v 0.9.9. WPF sample (SchoolSample) is updated with unit testing for both ViewModel and Model classes.SubExtractor: Release 1020: Feature: added "baseline double quotes" character to selector box Feature: added option to save SRT files as ANSI (instead of previous UTF-8 only) Feature: made "Save Sup files to Source directory" apply to both Sup and Idx source files. Fix: removed SDH text (...) or [...] that is split over 2 lines Fix: better decision-making in when to prefix a line with a '-' because SDH was removedAcDown????? - Anime&Comic Downloader: AcDown????? v3.6.1: ?? ● AcDown??????????、??????，??????????????????????，???????Acfun、Bilibili、???、???、???、Tucao.cc、SF???、?????80????，???????????、?????????。 ● AcDown???????????????????????????，???，???????????????????。 ● AcDown???????C#??，????.NET Framework 2.0??。?????"Acfun?????"。 ????32??64? Windows XP/Vista/7 ????????????? ??：????????Windows XP???,?????????.NET Framework 2.0???(x86)?.NET Framework 2.0???(x64)，?????"?????????"??? ??????????????，??????????： ??"AcDown?????"????????? ?? v3.6.1?? ??.hlv...Track Folder Changes: Track Folder Changes 1.1: Fixed exception when right-clicking the root nodeKinect Toolbox: Kinect Toolbox v1.1.0.2: This version adds support for the Kinect for Windows SDK beta 2.MapWindow 4: MapWindow GIS v4.8.6 - Final release - 32Bit: This is the final release of MapWindow v4.8. It has 4.8.6 as version number. This version has been thoroughly tested. If you do get an exception send the exception to us. Don't forget to include your e-mail address. Use the forums at http://www.mapwindow.org/phorum/ for questions. Please consider donating a small portion of the money you have saved by having free GIS tools: http://www.mapwindow.org/pages/donate.php What’s New in 4.8.6 (Final release) · A few minor issues have been fixed Wha...Kinect Mouse Cursor: Kinect Mouse Cursor 1.1: Updated for Kinect for Windows SDK v1.0 Beta 2!Coding4Fun Kinect Toolkit: Coding4Fun Kinect Toolkit 1.1: Updated for Kinect for Windows SDK v1.0 Beta 2!Async Executor: 1.0: Source code of the AsyncExecutorMedia Companion: MC 3.421b Weekly: Ensure .NET 4.0 Full Framework is installed. (Available from http://www.microsoft.com/download/en/details.aspx?id=17718) Ensure the NFO ID fix is applied when transitioning from versions prior to 3.416b. (Details here) TV Show Resolutions... Fix to show the season-specials.tbn when selecting an episode from season 00. Before, MC would try & load season00.tbn Fix for issue #197 - new show added by 'Manually Add Path' not being picked up. Also made non-visible the same thing in Root Folders...Nearforums - ASP.NET MVC forum engine: Nearforums v7.0: Version 7.0 of Nearforums, the ASP.NET MVC Forum Engine, containing new features: UI: Flexible layout to handle both list and table-like template layouts. Theming - Visual choice of themes: Deliver some templates on installation, export/import functionality, preview. Allow site owners to choose default list sort order for the forums. Forum latest activity. Visit the project Roadmap for more details. Webdeploy packages sha1 checksum: e6bb913e591543ab292a753d1a16cdb779488c10?????????? - ????????: All-In-One Code Framework ??? 2011-11-02: http://download.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=1codechs&DownloadId=216140 ??????，11??，?????20????Microsoft OneCode Sample，????6?Program Language Sample，2?Windows Base Sample，2?GDI+ Sample，4?Internet Explorer Sample?6?ASP.NET Sample。?????????????。 ????，?????。http://i3.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=1code&DownloadId=128165 Program Language CSImageFullScreenSlideShow VBImageFullScreenSlideShow CSDynamicallyBuildLambdaExpressionWithFie...Python Tools for Visual Studio: 1.1 Alpha: We’re pleased to announce the release of Python Tools for Visual Studio 1.1 Alpha. Python Tools for Visual Studio (PTVS) is an open-source plug-in for Visual Studio which supports programming with the Python programming language. This release includes new core IDE features, a couple of new sample libraries for interacting with Kinect and Excel, and many bug fixes for issues reported since the release of 1.0. For the core IDE features we’ve added many new features which improve the basic edit...BExplorer (Better Explorer): Better Explorer 2.0.0.631 Alpha: Changelog: Added: Some new functions in ribbon Added: Possibility to choose displayed columns Added: Basic Search Fixed: Some bugs after navigation Fixed: Attempt to fix slow navigation and slow start Known issues: - BreadcrumbBar fails on some situations - Basic search not work quite well in some situations Please if anyone find bugs be kind and report them at the Issue Tracker! Thanks!DotNetNuke® Community Edition: 05.06.04: Major Highlights Fixed issue with upgrades on systems that had upgraded the Telerik library to 6.0.0 Fixed issue with Razor Host upgrade to 5.6.3 The logic for module administration checks contains incorrect logic in 1 place, opening the possibility of a user with edit permissions gaining access to functionality they should not have through a particularly crafted url Security FixesBrowsers support the ability to remember common strings such as usernames/addresses etc. Code was adde...Terminals: Version 2.0 - Beta 3 Release: Beta 3 Refresh Dont forget to backup your config files BEFORE upgrading! The team has finally put the nail into the official release date for version 2.0. As bugs are winding down on the 2.0 Roadmap we decided to push out another build - the first 2.0 Beta build. Please take time to use and abuse this release. We left logging in place, and this is a debug build so be sure to submit your logs on each bug reported, and please do report all bugs! Check the source code page on the site, th...iTuner - The iTunes Companion: iTuner 1.4.4322: Added German (unverified, apologies if incorrect) Properly source invariant resources with correct resIDs Replaced obsolete lyric providers with working providers Fix Pseudolater to correctly morph every third char Fix null reference in CatalogBaseNew ProjectsA Blog: This is a blog plus personal web page frameworkAccess 1-D Intersection: This is an Access VBA Module containing functions that allow make it easy to determine overlaps in 1-D intervals. For instance if table A contains a range of 0-7 and Table B contains a range of 5-10, the intersection is 5-7.AkismetPC: A C# implementation of the popular anti-spam plugin Akismet. There aren't many .NET versions of Akismet so I decided to write one and that can be used with .NET blog engines such as Subtext, etc.AlertMonkey: A multicast chat client that enables users to send html, images, sounds, and files to connected users. Provides specialized alert types such as lunch and happy hour, as well as channel support.Azzeton: azzetonBKWork: private project.Blue: Blue is a web application for italian baseball and softball umpires.Build Javascript Models from .Net Classes: Build JavaScript Data Models from .Net Classes automaticallycmpp: cmppCRM 2011 TreeView for Dependent Picklist: This utility will allow CRM Customizer to configure Dependent Picklist items which will be shown as TreeView control on CRM form.DirSign: DirSign is a console exe that evaluates or checks directory signature. DirSign is used to check if something in a directory tree has changed (a file date or a file size or a new or missing file). You can use DirSign in scenario where you need to check if something changed since last time but where you can't install a file system watcher.epictactics: Game for WP7Export SharePoint 2010 External List to Excel: Export SharePoint 2010 external list to Excel with custom ribbon plugin. Export current external list with selected view to office 97 - 2003 or office 2007 - 2010.Floridum: Project for a XML Database.GNU ISO8583: GISO (GNU ISO) is a tool that makes it easier to analyze ISO 8583 financial transactions and also provides a platform to create a host simulator, capable of receiving requests and sending back the responses. It’s a WinForms application and it’s developed using C#.G's Syndication Pocket: G's Syndication Pocket is simple RSS Aggregate application. This is suitable for .NET Compact Framework. I checked it on Sharp's W-ZERO3.Hatena Netfx Library: .NET Library for Hatena Services.inohigo: a programming language that was developed by inohiro.Internet Cache Examiner: Internet Cache Examiner allows Internet Explorer INDEX.DAT files to be read directly, allowing the extraction of more information than is displayed in Internet Explorer, and without being limited to viewing only the activity of the current user. It's developed in C#.Javascript to IQueryable: javascript to IQueryable is an implementation that allows to write a simple query in javascript and then execute it on the server with EntityFramework or a linq provider that implement IQueryable.kisd: Just my code, wanted to keep it safe.LUCA UI for Silverlight 4: LUCA UI is a collection of flexible layout controls for Silverlight 4. Basically, using these controls you can create the same type of user-definable UI that Visual Studio and Expression Blend have.Messenger Game - Starter Kit: Kom godt i gang med at lave spil til Messenger med dette komplette Starter Kit. Indeholder et komplet netværksspil lavet med Messenger Activity API og Silverlight.Music Keys: Music KeysMyNote: MyNoteOpen Source Data System: DataSystem is a file based database system that is thread safe. It is a dynamically generated database meaning developers can either structure it outside the application prior or development. PhotoDesktop: Create background images for your desktop using hundreds of your photos off your local computer. (coming soon - use flickr [or other RSS] feeds)SharePoint Backup Augmentation Cmdlets: The SharePoint Backup Augmentation Cmdlets (SharePointBAC) provide administrators with additional PowerShell cmdlets to complement and extend SharePoint 2010's native backup and restore capabilities. SharePointBAC makes it possible to groom backup sets, archive backups, and more.SharpClassifier: C "Classifier" is an AI software component that tries to classify instances from given evidence (if shiny then diamond). A famous example is classifying email spam, separating it from ham. SharpClassifier currently only contains a single classifier - A Bayesian Naive Classifier. Most Bayesian Naive Classifiers for C# you'll find out there only handles two classes (spam/ham), but this implementation supports any number of classses.Shell Sort Web service and Application: this is a webservice of Sorting methode. use Shell sort methode to sorthing a unsorted number, and it can give a boundary as you input this project is made by Information System students, Ma Chung University , Malang - East Java - Indonesia [url:www.Machung.ac.id] Anna Letizia & SetiawanEka Prayuda Barbiezztissa@gmail.com & setya_09@hotmail.comSistema UELS: adsfasdfSorting Number use Insertion Sort on Web Service: This program can simulate the insertion sort easily.TA_Sorted_App01: First implementation of TA_Sorted Algorithm ThinkDeeper MVC framework: ThinkDeeper MVC is a WPF MVC for .NET 3.5. Typing Game: The Nottingham Game Developer's first game.xBlog: xBlog is a project to build a simple and extensible Blog Engine based on xml and linqXNA DebugDrawer Using Spritebatch: This project serves to show how to draw lines and rectangles using XNA's Spritebatch. This project uses XNA 4.0 and C# programming languageYet another Scedule Planner: YASP - Yet another Scedule Planner

Read the article
DataMining / Analyzing responses to Multiple Choice Questions in a survey

- by Shailesh Tainwala

Hi, I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined from this information? The following is a list of what I have come up with so far- A study of percentages (Example - Percentage of people who answered B on Qs.5 and got selected for free product trial) Conditional probabilities (Example - What is the probability that a person will get selected for free product trial given that he answered B on Qs.5) Naive Bayesian classifier (This can be used to predict whether a person will be selected or not for a given set of values for any subset of questions). Can you think of any other interesting analysis or data-mining activities that can be performed? The usual suspects like correlation can be eliminated as the response is not quantifiable/scoreable. Is my approach correct?

Read the article
Java Spam Filter

- by JackSparrow

I'm trying to create a spam filter in Java using the Bayesian algorithm. I use a text file that contains email messages and split the tokens using regex, storing these values into a hashmap. My problem is, with regex, the email addresses are split so instead of: [email protected] regex causes the token to be: john smith example The same holds true for ip addresses, so for example, instead of: 192.55.34.322 regex splits the tokens to be: 192 55 34 322 So does anybody know of a way that I could read the email messages and store their contents as is?

Read the article
Static Evaluation Function for Checkers

- by Kamikaze

Hi all! I'm trying to write an evaluation function for a game of checkers that I'm developing but I can't find the right documentation. I've read several documents on the web witch describe different techniques for either writing one or letting the computer find it(using genetic algorithms or Bayesian learning) but they're too complicated for a novice like me. All documents pointed a reference to "Some studies in machine learning using the game of checkers" by A.L.Samuel but I couldn't get my hands on it yet :(. I've only read the follow up "Some studies in machine learning using the game of checkers -II" and found some good info there, but it doesn't explain what the eval parameters mean (I think I don't have the whole article). Thanks for your help!

Read the article
Detecting syllables in a word

- by user50705

I need to find a fairly efficient way to detect syllables in a word. E.g., invisible - in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. e.g., pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC) I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful). The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis. I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches. I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Read the article
Beginner's resources/introductions to classification algorithms.

- by Dirk

Hi, everybody. I am entirely new to the topic of classification algorithms, and need a few good pointers about where to start some "serious reading". I am right now in the process of finding out, whether machine learning and automated classification algorithms could be a worthwhile thing to add to some application of mine. I already scanned through "How to Solve It: Modern heuristics" by Z. Michalewicz and D. Fogel (in particular, the chapters about linear classifiers using neuronal networks), and on the practical side, I am currently looking through the WEKA toolkit source code. My next (planned) step would be to dive into the realm of Bayesian classification algorithms. Unfortunately, I am lacking a serious theoretical foundation in this area (let alone, having used it in any way as of yet), so any hints at where to look next would be appreciated; in particular, a good introduction of available classification algorithms would be helpful. Being more a craftsman and less a theoretician, the more practical, the better... Hints, anyone?

Read the article
Algorithm to classify a list of products?

- by Martin

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives. Seagate Hard Drive 500Go Seagate Hard Drive 120Go for laptop Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive New and shinny 500Go hard drive from Seagate Seagate Barracuda 7200.12 Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6. We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions? I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated! Thanks.

Read the article
Simple method for reliably detecting code in text?

- by Jeff Atwood

GMail has this feature where it will warn you if you try to send an email that it thinks might have an attachment. Because GMail detected the string see the attached in the email, but no actual attachment, it warns me with an OK / Cancel dialog when I click the Send button. We have a related problem on Stack Overflow. That is, when a user enters a post like this one: my problem is I need to change the database but I don't won't to create a new connection. example: DataSet dsMasterInfo = new DataSet(); Database db = DatabaseFactory.CreateDatabase("ConnectionString"); DbCommand dbCommand = db.GetStoredProcCommand("uspGetMasterName"); This user did not format their code as code! That is, they didn't indent by 4 spaces per Markdown, or use the code button (or the keyboard shortcut ctrl+k) which does that for them. Thus, our system is accreting a lot of edits where people have to go in and manually format code for people that are somehow unable to figure this out. This leads to a lot of bellyaching. We've improved the editor help several times, but short of driving over to the user's house and pressing the correct buttons on their keyboard for them, we're at a loss to see what to do next. That's why we are considering a Google GMail style warning: Did you mean to post code? You wrote stuff that we think looks like code, but you didn't format it as code by indenting 4 spaces, using the toolbar code button or the ctrl+k code formatting command. However, presenting this warning requires us to detect the presence of what we think is unformatted code in a question. What is a simple, semi-reliable way of doing this? Per Markdown, code is always indented by 4 spaces or within backticks, so anything correctly formatted can be discarded from the check immediately. This is only a warning and it will only apply to low-reputation users asking their first questions (or providing their first answers), so some false positives are OK, so long as they are about 5% or less. Questions on Stack Overflow can be in any language, though we can realistically limit our check to, say, the "big ten" languages. Per the tags page that would be C#, Java, PHP, JavaScript, Objective-C, C, C++, Python, Ruby. Use the Stack Overflow creative commons data dump to audit your potential solution (or just pick a few questions in the top 10 tags on Stack Overflow) and see how it does. Pseudocode is fine, but we use c# if you want to be extra friendly. The simpler the better (so long as it works). KISS! If your solution requires us to attempt to compile posts in 10 different compilers, or an army of people to manually train a bayesian inference engine, that's ... not exactly what we had in mind.

Read the article
Outlook 2007 "Mark as Not Junk" Dialog Confusion

- by David

Outlook 2007's "Not Junk" button opens the "Mark as Not Junk" dialog. The dialog works correctly if I keep the "Always trust e-mail from <email address>" option checked. That is, the message is removed from the Junk folder and returns to the Inbox. However, if I uncheck the "Always trust" box, pressing OK dismisses the dialog, but nothing else happens. Why not? According to Outlook help, "When you mark a message as not junk, you are given the option of adding the sender or the mailing list name to your Safe Senders List or Safe Recipients List." That sure makes it sound like this is just an option, and not necessary for the core functionality of the action. I really don't want to trust a (possibly forged) From: address, but I do want my mail back in the Inbox. I could manually drag it, but I'm assuming that marking a message as not junk also trains some kind of bayesian filter. Am I mistaken? Thanks.

Read the article
Moving automatically spam messages to a folder in Postfix

- by cad

Hi My problem is that I want to automatically to move spam messages to a folder and not sure how. I have a linux box giving email access. MTA is Postfix, IMAP is Courier. As webmail client I use Squirrelmail. To filter SPAM I use Spamassassin and is working ok. Spamassasin is overwriting subjects with [--- SPAM 14.3 ---] Viagra... Also is adding headers: X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on xxxx X-Spam-Level: ************** X-Spam-Status: Yes, score=14.3 required=2.0 tests=BAYES_99, DATE_IN_FUTURE_24_48,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_PBL, RCVD_IN_SORBS_WEB,RCVD_IN_XBL,RDNS_NONE,URIBL_RED,URIBL_SBL autolearn=no version=3.2.5 X-Spam-Report: * 0.0 URIBL_RED Contains an URL listed in the URIBL redlist * [URIs: myimg.de] * 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.0000] * 0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL * [113.170.131.234 listed in zen.spamhaus.org] * 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL * 0.6 RCVD_IN_SORBS_WEB RBL: SORBS: sender is a abuseable web server * [113.170.131.234 listed in dnsbl.sorbs.net] * 3.2 DATE_IN_FUTURE_24_48 Date: is 24 to 48 hours after Received: date * 0.0 HTML_MESSAGE BODY: HTML included in message * 1.5 MIME_HTML_ONLY BODY: Message only has text/html MIME parts * 1.5 URIBL_SBL Contains an URL listed in the SBL blocklist * [URIs: myimg.de] * 0.1 RDNS_NONE Delivered to trusted network by a host with no rDNS I want to automatically to move spam messages to a folder. Ideally (not sure if possible) only to move messages with puntuation 5.0 or more to folder.. spam between 2.0 and 5.0 I want to be stored in Inbox. (I plan later to switch autolearn on) After reading a lot in procmail, postfix and spamassasin sites and googling a lot (lot of outdated howtos) I found two solutions but not sure which is the best or if there is another one: Put a rule in squirrelmail (dirty solution?) Use Procmail Which is the best option? Do you have any updated howto about it? Thanks

Read the article
If spambots are filling out forms on my website, should I mark the mail as spam?

- by rob

Apparently spambots have discovered the contact forms my website. I'm considering adding a CAPTCHA, but in the meantime I've been getting dozens of completed website forms in my Gmail account each day. I know I can configure a filter to always allow e-mails from my website to be delivered (i.e., "never send to spam"), but I'm wondering whether there are also larger implications. At first, I was marking the spam forms as spam because I figured it would block the sender's e-mail address from getting through again. But since then, I've given it some more thought. I know many spam filters use Bayesian filters and other techniques to identify spam based on its content. If I mark all these e-mails as spam, will legitimate e-mails from my website be marked as spam? Even if I stop marking the messages as spam, I can see another potential problem. My form is pretty standard (in fact, it's probably very similar to other online forms). If I mark my spambot-submitted website forms as spam, will Gmail begin to mark other people's similar-looking messages as spam?

Read the article

Search Results

Search found 61 results on 3 pages for 'bayesian'.

Page 2/3 | < Previous Page | 1 2 3 | Next Page >

- by Silver Dragon

- by BCS

- by Oct

- by LM

- by hiffy

- by apalsson

- by burtonic

- by user95261

- by burtonic

- by wefwgeweg

- by Spines

- by Doug Harris

- by pascal

- by haimg

- by Shailesh Tainwala

- by JackSparrow

- by Kamikaze

- by user50705

- by Dirk

- by Martin

- by Jeff Atwood

- by David

- by cad

- by rob

< Previous Page | 1 2 3 | Next Page >