Search Results

Search found 39 results on 2 pages for 'corpus'.

Page 2/2 | < Previous Page | 1 2

Using Markov models to convert all caps to mixed case and related problems

- by hippietrail

I've been thinking about using Markov techniques to restore missing information to natural language text. Restore mixed case to text in all caps Restore accents / diacritics to languages which should have them but have been converted to plain ASCII Convert rough phonetic transcriptions back into native alphabets That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context. I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities. Am I on the right track? Are there already some services, libraries, or tools for this sort of thing? Examples GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush tantot il rit a gorge deployee - tantôt il rit à gorge déployée

Read the article
Given a document select a relevant snippet.

- by BCS

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question? My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)

Read the article
Inter-rater agreement (Fleiss' Kappa, Krippendorff's Alpha etc) Java API?

- by adam

I am working on building a Question Classification/Answering corpus as a part of my masters thesis. I'm looking at evaluating my expected answer type taxonomy with respect to inter-rater agreement/reliability, and I was wondering: Does anybody know of any decent (preferably free) Java API(s) that can do this? I'm reasonably certain all I need is Fleiss' Kappa and Krippendorff's Alpha at this point. Weka provides a kappa statistic in it's evaluation package, but I think it can only evaluate a classifier and I'm not at that stage yet (because I'm still building the data set and classes). Thanks.

Read the article
Constructing a tree using Python

- by stealthspy

I am trying to implement a unranked boolean retrieval. For this, I need to construct a tree and perform a DFS to retrieve documents. I have the leaf nodes but I am having difficulty to construct the tree. Eg: query = OR ( AND (maria sharapova) tennis) Result: OR | | AND tennis | | maria sharapova I traverse the tree using DFS and calculate the boolean equivalent of certain document ids to identify the required document from the corpus. Can someone help me with the design of this using python? I have parsed the query and retrieved the leaf nodes for now.

Read the article
Where Next for Google Translate? And What of Information Quality?

- by ultan o'broin

Fascinating article in the UK Guardian newspaper called Can Google break the computer language barrier? In it, Andreas Zollman, who works on Google Translate, comments that the quality of Google Translate's output relative to the amount of data required to create that output is clearly now falling foul of the law of diminishing returns. He says: "Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output," he suggests, but the doublings are not infinite. "We are now at this limit where there isn't that much more data in the world that we can use," he admits. "So now it is much more important again to add on different approaches and rules-based models." The Translation Guy has a further discussion on this, called Google Translate is Finished. He says: "And there aren't that many doublings left, if any. I can't say how much text Google has assimilated into their machine translation databases, but it's been reported that they have scanned about 11% of all printed content ever published. So double that, and double it again, and once more, shoveling all that into the translation hopper, and pretty soon you get the sum of all human knowledge, which means a whopping 1.5% improvement in the quality of the engines when everything has been analyzed. That's what we've got to look forward to, at best, since Google spiders regularly surf the Web, which in its vastness dwarfs all previously published content. So to all intents and purposes, the statistical machine translation tools of Google are done. Outstanding job, Googlers. Thanks." Surprisingly, all this analysis hasn't raised that much comment from the fans of machine translation, or its detractors either for that matter. Perhaps, it's the season of goodwill? What is clear to me, however, of course is that Google Translate isn't really finished (in any sense of the word). I am sure Google will investigate and come up with new rule-based translation models to enhance what they have already and that will also scale effectively where others didn't. So too, will they harness human input, which really is the way to go to train MT in the quality direction. But that aside, what does it say about the quality of the data that is being used for statistical machine translation in the first place? From the Guardian article it's clear that a huge humanly translated corpus drove the gains for Google Translate and now what's left is the dregs of badly translated and poorly created source materials that just can't deliver quality translations. There's a message about information quality there, surely. In the enterprise applications space, where we have some control over content this whole debate reinforces the relationship between information quality at source and translation efficiency, regardless of the technology used to do the translation. But as more automation comes to the fore, that information quality is even more critical if you want anything approaching a scalable solution. This is important for user experience professionals. Issues like user generated content translation, multilingual personalization, and scalable language quality are central to a superior global UX; it's a competitive issue we cannot ignore.

Read the article
Getting started with character and text processing (encoding, regular expressions)

- by TK

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay." I don't say I need to learn about advanced topics right away. But I need to know: Bit and bytes level knowledge of encodings. Characters and alphabets not used in English. Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.) Regular expressions. Algorithm for text processing. Parsing natural languages. I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text. I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)

Read the article
How do you efficiently implement a document similarity search system?

- by Björn Lindqvist

How do you implement a "similar items" system for items described by a set of tags? In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you will like these too" system. I am familiar with Cosine similarity and using that algorithm works very well. But it is way to slow. For each article, I need to iterate over all articles, calculate the cosine similarity for the article pair and then select the five articles with the highest similarity rating. With 200k articles and 30k tags, it takes me half a minute to calculate the similar articles for a single article. So I need another algorithm that produces roughly as good results as cosine similarity but that can be run in realtime and which does not require me to iterate over the whole document corpus each time. Maybe someone can suggest an off-the-shelf solution for this? Most of the search engines I looked at does not enable document similarity searching.

Read the article
Iterating through String word at a time in Python

- by AlgoMan

I have a string buffer of a huge text file. I have to search a given words/phrases in the string buffer. Whats the efficient way to do it ? I tried using re module matches. But As i have a huge text corpus that i have to search through. This is taking large amount of time. Given a Dictionary of words and Phrases. I iterate through the each file, read that into string , search all the words and phrases in the dictionary and increment the count in the dictionary if the keys are found. One small optimization that we thought was to sort the dictionary of phrases/words with the max number of words to lowest. And then compare each word start position from the string buffer and compare the list of words. If one phrase is found, we don search for the other phrases (as it matched the longest phrase ,which is what we want) Can some one suggest how to go about word by word in the string buffer. (Iterate string buffer word by word) ? Also, Is there any other optimization that can be done on this ?

Read the article
Performing a SVD on tweets. Memory problem

- by plotti

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this: word1, word2, word3, ..., word14400 person1 1 2 0 1 person2 0 0 1 0 ... person650 It contains the word counts for each person. Like this I am getting characteristic vectors for each person. I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is: should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt? I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages.... -- So in general I am asking for advice how to perform a svd on such a corpus.

Read the article
Cannot install g++ on ubuntu

- by Erel Segal

I don't have g++: erelsgl@ubuntu:/etc/apt$ which g++ erelsgl@ubuntu:/etc/apt$ erelsgl@ubuntu:/etc/apt$ g++ The program 'g++' can be found in the following packages: * g++ * pentium-builder Try: sudo apt-get install <selected package> So I try to install it: erelsgl@ubuntu:~/srilm$ sudo apt-get install g++ Reading package lists... Done Building dependency tree Reading state information... Done g++ is already the newest version. 0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded. 2 not fully installed or removed. After this operation, 0B of additional disk space will be used. Setting up g++ (4:4.4.3-1ubuntu1) ... update-alternatives: error: alternative path /usr/bin/g++ doesn't exist. dpkg: error processing g++ (--configure): subprocess installed post-installation script returned error exit status 2 dpkg: dependency problems prevent configuration of build-essential: build-essential depends on g++ (>= 4:4.3.1); however: Package g++ is not configured yet. dpkg: error processing build-essential (--configure): dependency problems - leaving unconfigured No apport report written because the error message indicates its a followup error from a previous failure. Errors were encountered while processing: g++ build-essential E: Sub-process /usr/bin/dpkg returned an error code (1) I also try to install build-essential, and get same results. I also tried "sudo apt-get update" - didn't help. This is my apt-cache: erelsgl@ubuntu:/etc/apt$ apt-cache policy g++ build-essential g++: Installed: 4:4.4.3-1ubuntu1 Candidate: 4:4.4.3-1ubuntu1 Version table: *** 4:4.4.3-1ubuntu1 0 500 http://il.archive.ubuntu.com/ubuntu/ lucid/main Packages 100 /var/lib/dpkg/status build-essential: Installed: 11.4build1 Candidate: 11.4build1 Version table: *** 11.4build1 0 500 http://il.archive.ubuntu.com/ubuntu/ lucid/main Packages 100 /var/lib/dpkg/status erelsgl@ubuntu:/etc/apt$ I also tried this and got the same error: erelsgl@ubuntu:~/Ace/Files/corpus$ sudo dpkg --configure -a Setting up g++ (4:4.4.3-1ubuntu1) ... update-alternatives: error: alternative path /usr/bin/g++ doesn't exist. dpkg: error processing g++ (--configure): subprocess installed post-installation script returned error exit status 2 dpkg: dependency problems prevent configuration of build-essential: build-essential depends on g++ (>= 4:4.3.1); however: Package g++ is not configured yet. dpkg: error processing build-essential (--configure): dependency problems - leaving unconfigured Errors were encountered while processing: g++ build-essential

Read the article
Sentiment analysis with NLTK python for sentences using sample data or webservice?

- by Ke

I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice) I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??) Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron" Then I would like to check for positive/negative sentiment in each sentence and count them accordingly NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm. Here are the troubles I am having: All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?) I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary? I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link? Any help is much appreciated and will save me much hair! Cheers Ke

Read the article
Using a "white list" for extracting terms for Text Mining

- by [email protected]

In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using Oracle Data Miner, which automates preparing data for text mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms. However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms. INTRODUCTIONA stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be excluded if they do not help to differentiate documents, e.g., the word Oracle is ubiquitous in the Oracle product literature. One problem with stoplists is determining which words to specify. This usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the documents to determine if you missed any. Since a corpus of documents could contain thousands of words, this could be a tedious exercise. Moreover, since every word is considered as an individual token, a term excluded in one context may be needed to help identify a term in another context. For example, in our Oracle product literature example, the words "Oracle Data Mining" taken individually are not particular helpful. The term "Oracle" may be found in nearly all documents, as with the term "Data." The term "Mining" is more unique, but could also refer to the Mining industry. If we exclude "Oracle" and "Data" by specifying them in the stoplist, we lose valuable information. But it we include them, they may introduce too much noise. Still, when you have a broad vocabulary or don't have a list of specific terms of interest, you rely on the text engine to identify important terms, often by computing the term frequency - inverse document frequency metric. (This is effectively a weight associated with each term indicating its relative importance in a document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable. As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list, or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to this approach is the need to specify the set of terms of interest. However, this may not be as daunting a task as it seems. For example, in a given domain (Oracle product literature), there is often a recognized glossary, or a list of keywords and phrases (Oracle product names, industry names, product categories, etc.). Being able to identify multi-word terms, e.g., "Oracle Data Mining" or "Customer Relationship Management" as a single token can greatly increase the quality of the data mining results. The remainder of this post and subsequent posts will focus on how to produce a dataset that contains white list terms, suitable for mining. CREATING A WHITE LIST We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules that are in effect our mapping from single and multi-word terms to the tokens used to represent those terms. For example, "Oracle Data Mining" becomes "ORACLEDATAMINING." First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column. TERM TOKEN DATA MINING DATAMINING ORACLE DATA MINING ORACLEDATAMINING 11G ORACLE11G JAVA JAVA CRM CRM CUSTOMER RELATIONSHIP MANAGEMENT CRM ... Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules: CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE); CREATE TABLE my_thesaurus_rules (main_term VARCHAR2(100), query_string VARCHAR2(400)); We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus. DECLARE cursor c2 is select token, term from my_term_token_map; BEGIN for r_c2 in c2 loop CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term); EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values (:1,''SYN(' || r_c2.token || ', my_thesaurus)'')' using r_c2.token; end loop; END; We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example: 'ORACLEDATAMINING' SYN ('ORACLEDATAMINING', my_thesaurus)At this point, we create a CTXRULE index on the my_thesaurus_rules table: create index my_thesaurus_rules_idx on my_thesaurus_rules(query_string) indextype is ctxsys.ctxrule; In my next post, this index will be used to extract the tokens that match each of the rules specified. We'll then compute the tf-idf weights for each of the terms and create a nested table suitable for mining.

Read the article
CodePlex Daily Summary for Monday, September 10, 2012

CodePlex Daily Summary for Monday, September 10, 2012Popular ReleasesJson Services: Json Services 0.3.2: Json Services 0.3.2 released This release includes: JavaScript support Android support Flex SupportServiceMon - Extensible Real-time, Service Monitoring Utility: ServiceMon Release 0.9.0.50: Auto-uploaded from build serverSocial Network Importer for NodeXL: SocialNetImporter(v.1.5): This new version includes: - Fixed the "resource limit" bug caused by Facebook - Bug fixes To use the new graph data provider, do the following: Unzip the Zip file into the "PlugIns" folder that can be found in the NodeXL installation folder (i.e "C:\Program Files\Social Media Research Foundation\NodeXL Excel Template\PlugIns") Open NodeXL template and you can access the new importer from the "Import" menuAcDown????? - AcDown Downloader Framework: AcDown????? v4.1: ??●AcDown??????????、??、??、???????。????，????，?????????????????????????。???????????Acfun、????(Bilibili)、??、??、YouTube、??、???、??????、SF????、????????????。 ●??????AcPlay?????，??????、????????????????。 ● AcDown??????????????????，????????????????????????????。 ● AcDown???????C#??，????.NET Framework 2.0??。?????"Acfun?????"。 ????32??64? Windows XP/Vista/7/8 ???? 32??64? ???Linux ????(1)????????Windows XP???,?????????.NET Framework 2.0???(x86)，?????"?????????"??? (2)???????????Linux???，????????Mono?? ?...Move Mouse: Move Mouse 2.5.2: FIXED - Minor fixes and improvements.MVC Controls Toolkit: Mvc Controls Toolkit 2.3: Added The new release is compatible with Mvc4 RTM. Support for handling Time Zones in dates. Specifically added helper methods to convert to UTC or local time all DateTimes contained in a model received by a controller, and helper methods to handle date only fileds. This together with a detailed documentation on how TimeZones are handled in all situations by the Asp.net Mvc framework, will contribute to mitigate the nightmare of dates and timezones. Multiple Templates, and more options to...DNN Metro7 style Skin package: Metro7 06.02.00: Maintenance Release Changes on Metro7 06.02.00 Fixed width and height on the jQuery popup for the Editor. Navigation Provider changed to DDR menu Added menu files and scripts Changed skins to Doctype HTML Changed manifest to dnn6 manifest file Changed License to HTML view Fixed issue on Metro7/PinkTitle.ascx with double registering of the Actions Changed source folder structure and start folder, so the project works with the default DNN structure on developing Added VS 20...Mishra Reader: Mishra Reader Beta 4: Additional bug fixes and logging in this release to try to find the reason some users aren't able to see the main window pop up. Also, a few UI tweaks to tighten up the feed item list. This release requires the final version of .NET 4.5. If the ClickOnce installer doesn't work for you, please try the additional setup exe.Xenta Framework - extensible enterprise n-tier application framework: Xenta Framework 1.9.0: Release Notes Imporved framework architecture Improved the framework security More import/export formats and operations New WebPortal application which includes forum, new, blog, catalog, etc. UIs Improved WebAdmin app. Reports, navigation and search Perfomance optimization Improve Xenta.Catalog domain More plugin interfaces and plugin implementations Refactoring Windows Azure support and much more... Package Guide Source Code - package contains the source code Binaries...Json.NET: Json.NET 4.5 Release 9: New feature - Added JsonValueConverter New feature - Set a property's DefaultValueHandling to Ignore when EmitDefaultValue from DataMemberAttribute is false Fix - Fixed DefaultValueHandling.Ignore not igoring default values of non-nullable properties Fix - Fixed DefaultValueHandling.Populate error with non-nullable properties Fix - Fixed error when writing JSON for a JProperty with no value Fix - Fixed error when calling ToList on empty JObjects and JArrays Fix - Fixed losing deci...Microsoft Ajax Minifier: Microsoft Ajax Minifier 4.66: Just going to bite the bullet and rip off the band-aid... SEMI-BREAKING CHANGE! Well, it's a BREAKING change to those who already adjusted their projects to use the previous breaking change's ill-conceived renamed DLLs (versions 4.61-4.65). For those who had not adapted and were still stuck in this-doesn't-work-please-fix-me mode, this is more like a fixing change. The previous breaking change just broke too many people, I'm sorry to say. Renaming the DLL from AjaxMin.dll to AjaxMinLibrary.dl...DotNetNuke® Community Edition CMS: 07.00.00 CTP (Not for Production Use): NOTE: New Minimum Requirementshttp://www.dotnetnuke.com/Portals/25/Blog/Files/1/3418/Windows-Live-Writer-1426fd8a58ef_902C-MinimumVersionSupport_2.png Simplified InstallerThe first thing you will notice is that the installer has been updated. Not only have we updated the look and feel, but we also simplified the overall install process. You shouldn’t have to click through a series of screens in order to just get your website running. With the 7.0 installer we have taken an approach that a...BIDS Helper: BIDS Helper 1.6.1: In addition to fixing a number of bugs that beta testers reported, this release includes the following new features for Tabular models in SQL 2012: New Features: Tabular Display Folders Tabular Translations Editor Tabular Sync Descriptions Fixed Issues: Biml issues 32849 fixing bug in Tabular Actions Editor Form where you type in an invalid action name which is a reserved word like CON or which is a duplicate name to another action 32695 - fixing bug in SSAS Sync Descriptions whe...Umbraco CMS: Umbraco 4.9.0: Whats newThe media section has been overhauled to support HTML5 uploads, just drag and drop files in, even multiple files are supported on any HTML5 capable browser. The folder content overview is also much improved allowing you to filter it and perform common actions on your media items. The Rich Text Editor’s “Media” button now uses an embedder based on the open oEmbed standard (if you’re upgrading, enable the media button in the Rich Text Editor datatype settings and set TidyEditorConten...WordMat: WordMat v. 1.02: This version was used for the 2012 exam.menu4web: menu4web 0.4.1 - javascript menu for web sites: This release is for those who believe that global variables are evil. menu4web has been wrapped into m4w singleton object. Added "Vertical Tabs" example which illustrates object notation.Microsoft SQL Server Product Samples: Database: AdventureWorks OData Feed: The AdventureWorks OData service exposes resources based on specific SQL views. The SQL views are a limited subset of the AdventureWorks database that results in several consuming scenarios: CompanySales Documents ManufacturingInstructions ProductCatalog TerritorySalesDrilldown WorkOrderRouting How to install the sample You can consume the AdventureWorks OData feed from http://services.odata.org/AdventureWorksV3/AdventureWorks.svc. You can also consume the AdventureWorks OData fe...Desktop Google Reader: 1.4.6: Sorting feeds alphabetical is now optional (see preferences window)Droid Explorer: Droid Explorer 0.8.8.7 Beta: Bug in the display icon for apk's, will fix with next release Added fallback icon if unable to get the image/icon from the Cloud Service Removed some stale plugins that were either out dated or incomplete. Added handler for *.ab files for restoring backups Added plugin to create device backups Backups stored in %USERPROFILE%\Android Backups\%DEVICE_ID%\ Added custom folder icon for the android backups directory better error handling for installing an apk bug fixes for the Runn...The Visual Guide for Building Team Foundation Server 2012 Environments: Version 1: --New Projectsalimujtaba_project: bhssASP.NET Databinding Tutorial: This is a tutorial project on asp.net databinding.asp.net msaccess tutorial: This is a tutorial projectaspnet gridview: This is a tutorial projectaspnet login tutorial: this is a tutorial projectDataglot - A helper library for System.Data.Common: To support more than one database using System.Data.Common you need connection string conversion, light sql reparsing, multi-cast, trace and many more utilitiesDnsMigrate: This is a small powershell script to move a standalone Microsoft Dns Server to another machine. Also you can backup existing dns records with this script.Elements: Elements is an implementation of the Composite Application Guide for WPF. Written in C# using Visual Studio 2012.EQBAZ: EQ Baz pullFujiy Minesweeper: A MSN Messenger Minesweeper clone. You can play Minesweeper with a friend, who finds more bombs winsHobbyLister: The aim of this project is to create a organizetool for Movies, Books, Boardgame etc.hotfighter2: a gameJihanki: jihankimasterpage tutorial: This a tutorial projectMongo Explorer Tool: Mongo Explorer ToolMSMQ Managment Console: MSMQ Management Console is a command line tool that allows to run variety of commands on local and remote queues, with export and import capabilities.NTCPMSG: High performance open source TCP message send and receive component which written by C#.net.PlaySly: Aplicación para Windows 8 estilo Modern UI para usar series.ly. Está orientada al visionado online, críticas y votaciones. En proceso de construcción.RhotCMS: This is my asp.net MVC 4 CMS.SBQP: Trabalho de Conclusão de Curso de Graduação apresentado a Universidade Estácio de SáSimplyMessenger: SimplyMessenger is a light Network Messenger. It use C# and WPF.toki pona - corpus linguistic tools and various experiments: Various mini-linguistics tools targeting toki pona, a very small fake language. Of potential interest to hobby linguists and conlang enthusiasts.Turbo Server: An FTP Server for Android

Read the article
Java: If vs. Switch

- by _ande_turner_

I have a piece of code with a) which I replaced with b) purely for legibility ... a) if ( WORD[ INDEX ] == 'A' ) branch = BRANCH.A; /* B through to Y */ if ( WORD[ INDEX ] == 'Z' ) branch = BRANCH.Z; b) switch ( WORD[ INDEX ] ) { case 'A' : branch = BRANCH.A; break; /* B through to Y */ case 'Z' : branch = BRANCH.Z; break; } ... will the switch version cascade through all the permutations or jump to a case ? EDIT: Some of the answers below regard alternative approaches to the approach above. I have included the following to provide context for its use. The reason I asked, the Question above, was because the speed of adding words empirically improved. This isn't production code by any means, and was hacked together quickly as a PoC. The following seems to be a confirmation of failure for a thought experiment. I may need a much bigger corpus of words than the one I am currently using though. The failure arises from the fact I did not account for the null references still requiring memory. ( doh ! ) public class Dictionary { private static Dictionary ROOT; private boolean terminus; private Dictionary A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z; private static Dictionary instantiate( final Dictionary DICTIONARY ) { return ( DICTIONARY == null ) ? new Dictionary() : DICTIONARY; } private Dictionary() { this.terminus = false; this.A = this.B = this.C = this.D = this.E = this.F = this.G = this.H = this.I = this.J = this.K = this.L = this.M = this.N = this.O = this.P = this.Q = this.R = this.S = this.T = this.U = this.V = this.W = this.X = this.Y = this.Z = null; } public static void add( final String...STRINGS ) { Dictionary.ROOT = Dictionary.instantiate( Dictionary.ROOT ); for ( final String STRING : STRINGS ) Dictionary.add( STRING.toUpperCase().toCharArray(), Dictionary.ROOT , 0, STRING.length() - 1 ); } private static void add( final char[] WORD, final Dictionary BRANCH, final int INDEX, final int INDEX_LIMIT ) { Dictionary branch = null; switch ( WORD[ INDEX ] ) { case 'A' : branch = BRANCH.A = Dictionary.instantiate( BRANCH.A ); break; case 'B' : branch = BRANCH.B = Dictionary.instantiate( BRANCH.B ); break; case 'C' : branch = BRANCH.C = Dictionary.instantiate( BRANCH.C ); break; case 'D' : branch = BRANCH.D = Dictionary.instantiate( BRANCH.D ); break; case 'E' : branch = BRANCH.E = Dictionary.instantiate( BRANCH.E ); break; case 'F' : branch = BRANCH.F = Dictionary.instantiate( BRANCH.F ); break; case 'G' : branch = BRANCH.G = Dictionary.instantiate( BRANCH.G ); break; case 'H' : branch = BRANCH.H = Dictionary.instantiate( BRANCH.H ); break; case 'I' : branch = BRANCH.I = Dictionary.instantiate( BRANCH.I ); break; case 'J' : branch = BRANCH.J = Dictionary.instantiate( BRANCH.J ); break; case 'K' : branch = BRANCH.K = Dictionary.instantiate( BRANCH.K ); break; case 'L' : branch = BRANCH.L = Dictionary.instantiate( BRANCH.L ); break; case 'M' : branch = BRANCH.M = Dictionary.instantiate( BRANCH.M ); break; case 'N' : branch = BRANCH.N = Dictionary.instantiate( BRANCH.N ); break; case 'O' : branch = BRANCH.O = Dictionary.instantiate( BRANCH.O ); break; case 'P' : branch = BRANCH.P = Dictionary.instantiate( BRANCH.P ); break; case 'Q' : branch = BRANCH.Q = Dictionary.instantiate( BRANCH.Q ); break; case 'R' : branch = BRANCH.R = Dictionary.instantiate( BRANCH.R ); break; case 'S' : branch = BRANCH.S = Dictionary.instantiate( BRANCH.S ); break; case 'T' : branch = BRANCH.T = Dictionary.instantiate( BRANCH.T ); break; case 'U' : branch = BRANCH.U = Dictionary.instantiate( BRANCH.U ); break; case 'V' : branch = BRANCH.V = Dictionary.instantiate( BRANCH.V ); break; case 'W' : branch = BRANCH.W = Dictionary.instantiate( BRANCH.W ); break; case 'X' : branch = BRANCH.X = Dictionary.instantiate( BRANCH.X ); break; case 'Y' : branch = BRANCH.Y = Dictionary.instantiate( BRANCH.Y ); break; case 'Z' : branch = BRANCH.Z = Dictionary.instantiate( BRANCH.Z ); break; } if ( INDEX == INDEX_LIMIT ) branch.terminus = true; else Dictionary.add( WORD, branch, INDEX + 1, INDEX_LIMIT ); } public static boolean is( final String STRING ) { Dictionary.ROOT = Dictionary.instantiate( Dictionary.ROOT ); return Dictionary.is( STRING.toUpperCase().toCharArray(), Dictionary.ROOT, 0, STRING.length() - 1 ); } private static boolean is( final char[] WORD, final Dictionary BRANCH, final int INDEX, final int INDEX_LIMIT ) { Dictionary branch = null; switch ( WORD[ INDEX ] ) { case 'A' : branch = BRANCH.A; break; case 'B' : branch = BRANCH.B; break; case 'C' : branch = BRANCH.C; break; case 'D' : branch = BRANCH.D; break; case 'E' : branch = BRANCH.E; break; case 'F' : branch = BRANCH.F; break; case 'G' : branch = BRANCH.G; break; case 'H' : branch = BRANCH.H; break; case 'I' : branch = BRANCH.I; break; case 'J' : branch = BRANCH.J; break; case 'K' : branch = BRANCH.K; break; case 'L' : branch = BRANCH.L; break; case 'M' : branch = BRANCH.M; break; case 'N' : branch = BRANCH.N; break; case 'O' : branch = BRANCH.O; break; case 'P' : branch = BRANCH.P; break; case 'Q' : branch = BRANCH.Q; break; case 'R' : branch = BRANCH.R; break; case 'S' : branch = BRANCH.S; break; case 'T' : branch = BRANCH.T; break; case 'U' : branch = BRANCH.U; break; case 'V' : branch = BRANCH.V; break; case 'W' : branch = BRANCH.W; break; case 'X' : branch = BRANCH.X; break; case 'Y' : branch = BRANCH.Y; break; case 'Z' : branch = BRANCH.Z; break; } if ( branch == null ) return false; if ( INDEX == INDEX_LIMIT ) return branch.terminus; else return Dictionary.is( WORD, branch, INDEX + 1, INDEX_LIMIT ); } }

Read the article

Search Results

Search found 39 results on 2 pages for 'corpus'.

Page 2/2 | < Previous Page | 1 2

- by hippietrail

- by BCS

- by adam

- by stealthspy

- by ultan o'broin

- by TK

- by Björn Lindqvist

- by AlgoMan

- by plotti

- by Erel Segal

- by Ke

- by [email protected]

- by _ande_turner_

< Previous Page | 1 2