Search Results

Search found 28 results on 2 pages for 'stemming'.

Page 1/2 | 1 2 | Next Page >

another porter stemming algorithm implementation question ?

- by mike

Hi, I am trying to implement porter stemming algorithm, but i am having difficualties understanding this point Step 1c (*v*) Y -> I happy -> happi sky -> sky Isn't that the the opposite of what we want to do , why does the algorithim convert the Y into I. for the complete algorithm here http://tartarus.org/~martin/PorterStemmer/def.txt Thanks

Read the article
Google GSA Stems for scandinavian languages

- by HAXEN

I have installed Scandinavia-2.1-1 language bundle to our GSA. After that I expected to find those languages available in Query Expansion, but nope nothing new there. Am I missing something? How are you other Scandinavians handling stems for your language?

Read the article
Something confusing about FormsOf (Sql server Full-Text searching)

- by AspOnMyNet

hi I'm using Sql Server 2008 1) A given <simple_term> within a <generation_term> will not match both nouns and verbs. If I understand the above text correctly, then query SELECT * FROM someTable WHERE CONTAINS ( *, ' FORMSOF ( INFLECTIONAL, park ) ' ) should search for either nouns or verbs derived from the root word “park”, but not for both? Thus out of the two rows, one containing noun parks and other verb parking, the above query should return just one of the two rows? But as it turns out, query returns both rows, so are perhaps my assumptions a bit off or is the above quote wrong?! 2) From Msdn: If freetext_string is enclosed in double quotation marks, a phrase match is instead performed; stemming and thesaurus are not performed. According to the above quote the following query shouldn’t return rows containing strings surfing ( due to query not performing stemming ), surf ( due to query performing phrase matching and not individual word matching ) and surfing with suzy’s sister ( due to query not performing stemming and due to query performing phrase matching and not word matching ), but it does. Thus, it appears that even when *freetext_string* is enclosed in double quotation marks, stemming is still preformed, while phrase matching is not: SELECT * FROM someTable WHERE FREETEXT( *, ' "surf sister" ' ) So is the above quote wrong or...? thanx

Read the article
Book Review - Programming Windows Azure by Siriram Krishnan

- by BuckWoody

As part of my professional development, I’ve created a list of books to read throughout the year, starting in June of 2011. This a review of the first one, called Programming Windows Azure by Siriram Krishnan. You can find my entire list of books I’m reading for my career here: http://blogs.msdn.com/b/buckwoody/archive/2011/06/07/head-in-the-clouds-eyes-on-the-books.aspx Why I Chose This Book: As part of my learning style, I try to read multiple books about a single subject. I’ve found that at least 3 books are necessary to get the right amount of information to me. This is a “technical” work, meaning that it deals with technology and not business, writing or other facets of my career. I’ll have a mix of all of those as I read along. I chose this work in addition to others I’ve read since it covers everything from an introduction to more advanced topics in a single book. It also has some practical examples of actually working with the product, particularly on storage. Although it’s dated, many examples normally translate. I also saw that it had pretty good reviews. What I learned: I learned a great deal about storage, and many useful code snippets. I do think that there could have been more of a focus on the application fabric - but of course that wasn’t as mature a feature when this book was written. I learned some great architecture examples, and in one section I learned more about encryption. In that example, however, I would rather have seen the examples go the other way - the book focused on moving data from on-premise to Azure storage in an encrypted fashion. Using the Application Fabric I would rather see sensitive data left in a hybrid fashion on premise, and connect to for the Azure application. Even so, the examples were very useful. If you’re looking for a good “starter” Azure book, this is a good choice. I also recommend the last chapter as a quick read for a DBA, or Database Administrator. It’s not very long, but useful. Note that the limits described are incorrect - which is one of the dangers of reading a book about any cloud offering. The services offered are updated so quickly that the information is in constant danger of being “stale”. Even so, I found this a useful book, which I believe will help me work with Azure better. Raw Notes: I take notes as I read, calling that process “reading with a pencil”. I find that when I do that I pay attention better, and record some things that I need to know later. I’ll take these notes, categorize them into a OneNote notebook that I synchronize in my Live.com account, and that way I can search them from anywhere. I can even read them on the web, since the Live.com has a OneNote program built in. Note that these are the raw notes, so they might not make a lot of sense out of context - I include them here so you can watch my though process. Programming Windows Azure by Siriram Krishnan: Learning about how to select applications suitable for Distributed Technology. Application Fabric gets the least attention; probably because it was newer at the time. Very clear (Chapter One) Good foundation Background and history, but not too much I normally arrange my descriptions differently, starting with the use-cases and moving to physicality, but this difference helps me. Interesting that I am reading this using Safari Books Online, which uses many of these concepts. Taught me some new aspects of a Hypervisor – very low-level information about the Azure Fabric (not to be confused with the Application Fabric feature) (Chapter Two) Good detail of what is included in the SDK. Even more is available now. CS = Cloud Service (Chapter 3) Place Storage info in the configuration file, since it can be streamed in-line with a running app. Ditto for logging, and keep separated configs for staging and testing. Easy-switch in and switch out. (Chapter 4) There are two Runtime API’s, one of external and one for internal. Realizing how powerful this paradigm really is. Some places seem light, and to drop off but perhaps that’s best. Managing API is not charged, which is nice. I don’t often think about the price, until it comes to an actual deployment (Chapter 5) Csmanage is something I want to dig into deeper. API requires package moves to Blob storage first, so it needs a URL. Csmanage equivalent can be written in Unix scripting using openssl. Upgrades are possible, and you use the upgradeDomainCount attribute in the Service-Definition.csdef file Always use a low-privileged account to test on the dev fabric, since Windows Azure runs in partial trust. Full trust is available, but can be dangerous and must be well-thought out. (Chapter 6) Learned how to run full CMD commands in a web window – not that you would ever do that, but it was an interesting view into those links. This leads to a discussion on hosting other runtimes (such as Java or PHP) in Windows Azure. I got an expanded view on this process, although this is where the book shows its age a little. Books can be a problem for Cloud Computing for this reason – things just change too quickly. Windows Azure storage is not eventually consistent – it is instantly consistent with multi-phase commit. Plumbing for this is internal, not required to code that. (Chapter 7) REST API makes the service interoperable, hybrid, and consistent across code architectures. Nicely done. Use affinity groups to keep data and code together. Side note: e-book readers need a common “notes” feature. There’s a decent quick description of REST in this chapter. Learned about CloudDrive code – PowerShell sample that mounts Blob storage as a local provider. Works against Dev fabric by default, can be switched to Account. Good treatment in the storage chapters on the differences between using Dev storage and Azure storage. These can be mitigated. No, blobs are not of any size or number. Not a good statement (Chapter 8) Blob storage is probably Azure’s closest play to Infrastructure as a Service (Iaas). Blob change operations must be authenticated, even when public. Chapters on storage are pretty in-depth. Queue Messages are base-64 encoded (Chapter 9) The visibility timeout ensures processing of message in a disconnected system. Order is not guaranteed for a message, so if you need that set an increasing number in the queue mechanism. While Queues are accessible via REST, they are not public and are secured by default. Interesting – the header for a queue request includes an estimated count. This can be useful to create more worker roles in a dynamic system. Each Entity (row) in the Azure Table service is atomic – all or nothing. (Chapter 10) An entity can have up to 255 Properties Use “ID” for the class to indicate the key value, or use the [DataServiceKey] Attribute. LINQ makes working with the Azure Table Service much easier, although Interop is certainly possible. Good description on the process of selecting the Partition and Row Key. When checking for continuation tokens for pagination, include logic that falls out of the check in case you are at the last page. On deleting a storage object, it is instantly unavailable, however a background process is dispatched to perform the physical deletion. So if you want to re-create a storage object with the same name, add retry logic into the code. Interesting approach to deleting an index entity without having to read it first – create a local entity with the same keys and apply it to the Azure system regardless of change-state. Although the “Indexes” description is a little vague, it’s interesting to see a Folding and Stemming discussion a-la the Porter Stemming Algorithm. (Chapter 11) Presents a better discussion of indexes (at least inverted indexes) later in the chapter. Great treatment for DBA’s in Chapter 11. We need to work on getting secondary indexes in Table storage. There is a limited form of transactions called “Entity Group Transactions” that, although they have conditions, makes a transactional system more possible. Concurrency also becomes an issue, but is handled well if you’re using Data Services in .NET. It watches the Etag and allows you to take action appropriately. I do not recommend using Azure as a location for secure backups. In fact, I would rather have seen the examples in (Chapter 12) go the other way, showing how data could be brought back to a local store as a DR or HA strategy. Good information on cryptography and so on even so. Chapter seems out of place, and should be combined with the Blob chapter. (Chapter 13) on SQL Azure is dated, although the base concepts are OK. Nice example of simple ADO.NET access to a SQL Azure (or any SQL Server Really) database.

Read the article
Cannot find GRUB - Ubuntu/Windows 8 dual-boot

- by ubeatlenine

Hello Ubuntu community, I find myself in an interesting situation. I have a Dell Inspiron 531 with Windows Vista. Recently my brother decided it would be a good idea to overwrite Vista with the Windows 8 consumer preview. Since we have had this PC for a very long time, we have long since lost the Vista CD, and according to the Windows 8 preview website you cannot recover your previous OS without it. I thought this would be a good opportunity to try out Ubuntu (since we obviously cannot keep the preview as an OS), but it appears that Ubuntu 11.10 Desktop is not compatible with Win8. Ubuntu doesn't run from the LiveUSB I made, instead it freezes on the loading screen and then disintegrates into black and white stripes. I blamed this failure on Ubuntu not being compatible with win8 yet and tried to install Ubuntu from the USB on a partition made from the remaining space on my hard drive - about 100GB. However the installer crashed while loading modules and told me I didn't have enough disk space. Since then, I have not been able to load either Ubuntu or Windows, BIOS is shifted over to the left of my screen, and I always get the same message: error: unknown filesystem grub rescue> typing "ls" at the prompt gives me the following: (hd0) (hd0,msdos7) (hd0,msdos6) (hd0,msdos5) (hd0,msdos2) (hd0,msdos1) does this mean I have multiple partitions running windows on my computer? Is it possible to recover Vista without the disk? Are all of my problems stemming from Ubuntu not being compatible with Win8 preview? (I realize the majority of my questions are about Windows, but seeing as the prompt I get is for grub I thought I would ask here first.) Any insight anyone has on this predicament would be greatly appreciated.

Read the article
logback with EJB3.1

- by kgrad

I am using logback/slf4j to handle logging in my application. Everything was working perfectly until I started using EJBs. Once I added a stateless EJB to my app, the logger started ignoring my logback.xml and stopped using my appenders. I switched to a programmatic logger configuration to see what was wrong and now I am getting the following error when I try to use my logger within the EJB: org.slf4j.impl.JDK14LoggerFactory cannot be cast to ch.qos.logback.classic.LoggerContext stemming from the line: LoggerContext lc = (LoggerContext) LoggerFactory.getILoggerFactory(); Is there any special configuration necessary to get logback to work with EJBs? If it matters I am deploying on glassfish v3.

Read the article
Proper snowball analyzer configuration when using Grails Searchable Plugin

- by Wirsbro

To improve stemming we want to switch from the default analyzer to snowball, however, having a lot of difficulty with the proper settings and would appreciate any help. In Environment: - Sun's Java 1.6.16 - Grails 1.2.2 - Searchable Plug-In 0.5.5 Config.groovy: Have tried both settings: compassSettings = ['compass.engine.analyzer.stemmed.type': 'snowball', 'compass.engine.analyzer.stemmed.name': 'English'] compassSettings = ['compass.engine.analyzer.snowball.type': 'snowball', 'compass.engine.analyzer.snowball.name': 'English', 'compass.engine.analyzer.search.type': 'snowball', 'compass.engine.analyzer.search.name': 'English'] Search.groovy - The Invocation: def searchResult = searchableService.search(params.q, withHighlighter: { highlighter, index, sr if (!sr.highlights) { sr.highlights = [] } try { sr.highlights[index] = highlighter.fragments("content")[0..2].join(" ") } catch (IndexOutOfBoundsException ex) { sr.highlights[index] = highlighter.fragment("content") } }) def suggestion = searchableService.suggestQuery(params.q) if (suggestion != params.q) { searchResult.suggestedQuery = suggestion }

Read the article
Address Book is returning old values

- by Marcus Wood

Hi there, I am having a problem with the AddressBook framework. It all seems to be stemming from ABCopyRecordForUniqueId returning a record with old data. Example: I run up the program below in one terminal window - it shows the current data. I make a change through the address book UI - my program continues to show old data. I run up another instance of the same program in a new terminal window - it shows the updated data. I have tried posting on the omnigroup site with no luck :( so any guidance is really appreciated PS: If you would like to try the code, to get an address book ID you can export a contact as a vCard and open it with a text editor int main (int argc, const char * argv[]) { ABAddressBookRef addressBook = ABGetSharedAddressBook(); while(1) { ABRecordRef addressBookRecord = NULL; addressBookRecord = ABCopyRecordForUniqueId(addressBook, CFSTR("4064D587-0378-4DCF-A6B9-D3702F01C94C:ABPerson")); CFShow(addressBookRecord); CFRelease(addressBookRecord); sleep(1); } return 0; }

Read the article
Use C function in C++ program; "multiply-defined" error

- by eom

I am trying to use this code for the Porter stemming algorithm in a C++ program I've already written. I followed the instructions near the end of the file for using the code as a separate module. I created a file, stem.c, that ends after the definition and has extern int stem(char * p, int i, int j) ... It worked fine in Xcode but it does not work for me on Unix with gcc 4.1.1--strange because usually I have no problem moving between the two. I get the error ld: fatal: symbol `stem(char*, int, int)' is multiply-defined: (file /var/tmp//ccrWWlnb.o type=FUNC; file /var/tmp//cc6rUXka.o type=FUNC); ld: fatal: File processing errors. No output written to cluster I've looked online and it seems like there are many things I could have wrong, but I'm not sure what combination of a header file, extern "C", etc. would work.

Read the article
Android Touch Event Collision Detection

- by chrissb

I'm relatively new to both Java and Android, so hopefully the problem I'm having is stemming from something pretty minor that I've overlooked. I've got a (very early stage) game that I've started working on, for Android using Java. At this stage, when the user touches the screen, if they touched a point at which there is an enemy, the enemies health is decreased and they become immobile (for the current implementation at least). The issue that I'm having is that the touch detection doesn't always seem to work. I've got a testing sprite set up that goes to the eventX and eventY coordinates of the touch down event, and it always seems to collide with the enemy object. Yet, the enemy doesn't always register as being hit, and sometimes a hit is registered when the sprite indicates the touch coordinates were outside of the enemies bounding box. I realise that this probably doesn't mean much without any code, so here's what I've got so far. Be gentle, as this is literally my first attempt at something more than basic movement etc. First off, the MainGamePanel class registers the touch event, and informs the levelmanager class (which is what I set up to monitor/handle enemies) public boolean onTouchEvent(MotionEvent event) { if (event.getAction() == MotionEvent.ACTION_DOWN){ levelManager.handleActionDown((int)event.getX(), (int)event.getY()); targetX=event.getX(); targetY=event.getY(); } if (event.getAction() == MotionEvent.ACTION_MOVE) { //the gestures } if (event.getAction() == MotionEvent.ACTION_UP) { //touch was released } return true; } From there, in the levelmanager class the touch event is passed on to all of the enemies within a list array: public static void handleActionDown(int eventX,int eventY){ hit=false; for (enemy1 en : enemy1array){ en.handleActionDown(eventX, eventY); } } The rest of the collision code is handled within the enemies handleActionDown function: public void handleActionDown(int eventX, int eventY) { if(eventX>this.x-enemy1bitmap.getWidth() && eventX<this.x+enemy1bitmap.getWidth() && eventY>this.y-enemy1bitmap.getHeight() && eventY<this.x+enemy1bitmap.getHeight()){ takeDamage(1); levelmanager.setHit(); } } I should probably be using getWidth()/2 and getHeight()/2 for it to be more accurate, but I expanded the area to test this - although I've noticed no improvement. At this stage, the games detection over whether or not the enemy is hit is spotty at best. Generally it takes two or three attempts before a collision is successfully registered, even though the sprite that is being used for testing and set to the eventX and eventY coordinates always indicates that the collision should have worked. Hopefully someone can steer me in the right direction here, and if more information is needed, ask away! Cheers, -Chris

Read the article
User directive in nginx generates error despite running as UID root

- by Joost Schuur

I'm running nginx on a MacOS X machine, installed with brew, and when I launch nginx, even with sudo, I get the following warning in my log file over and over again: 4/21/11 2:03:42 AM org.nginx[3788] nginx: [warn] the "user" directive makes sense only if the master process runs with super-user privileges, ignored in /usr/local/etc/nginx/conf/nginx.conf:2 From nginx.conf: user jschuur staff; I'm already launching nginx with sudo, since I want the thing to listen on port 80. Shouldn't that be enough to give it the proper super user privileges? The nginx binary as it's installed: jschuur@Glenna:sbin ? master ls -la total 4544 drwxr-xr-x 3 jschuur staff 102 Apr 12 20:53 . drwxrwxr-x 15 jschuur staff 510 Apr 12 15:25 .. -rwxr-xr-x 1 jschuur staff 2325648 Apr 12 20:39 nginx FWIW, I recompiled the binary to set passenger up and moved it around from it's original location into /usr/local/sbin. Update: As it turns out MacOS X was restarting nginx after I'd stopped it, because the launchd plist in ~/Library/LaunchAgents had set it to 'KeepAlive'. However, because I installed this plist into my local user's LaunchAgents folder as opposed to /Library/LaunchAgents (or better yet /Library/LaunchDaemons, which run before you even log on), it wasn't executed as root. Because of an error about not having permissions to use port 80, it actually exited right away, but still wrote to the same log file as the nginx process I started with sudo. I had thought the errors stemming from the automatic restart were actually coming from my manual restart via sudo. So, bottom line, problem solved. The real problem here was the homebrew instructions specifically asking you to install the plist file into an area that wouldn't allow a local site to use port 80.

Read the article
How Exactly Is One Linux OS “Based On” Another Linux OS?

- by Jason Fitzpatrick

When reviewing different flavors of Linux, you’ll frequently come across phrases like “Ubuntu is based on Debian” but what exactly does that mean? Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites. The Question SuperUser reader PLPiper is trying to get a handle on how Linux variants work: I’ve been looking through quite a number of Linux distros recently to get an idea of what’s around, and one phrase that keeps coming up is that “[this OS] is based on [another OS]“. For example: Fedora is based on Red Hat Ubuntu is based on Debian Linux Mint is based on Ubuntu For someone coming from a Mac environment I understand how “OS X is based on Darwin”, however when I look at Linux Distros, I find myself asking “Aren’t they all based on Linux..?” In this context, what exactly does it mean for one Linux OS to be based on another Linux OS? So, what exactly does it mean when we talk about one version of Linux being based off another version? The Answer SuperUser contributor kostix offers a solid overview of the whole system: Linux is a kernel — a (complex) piece of software which works with the hardware and exports a certain Application Programming Interface (API), and binary conventions on how to precisely use it (Application Binary Interface, ABI) available to the “user-space” applications. Debian, RedHat and others are operating systems — complete software environments which consist of the kernel and a set of user-space programs which make the computer useful as they perform sensible tasks (sending/receiving mail, allowing you to browse the Internet, driving a robot etc). Now each such OS, while providing mostly the same software (there are not so many free mail server programs or Internet browsers or desktop environments, for example) differ in approaches to do this and also in their stated goals and release cycles. Quite typically these OSes are called “distributions”. This is, IMO, a somewhat wrong term stemming from the fact you’re technically able to build all the required software by hand and install it on a target machine, so these OSes distribute the packaged software so you either don’t need to build it (Debian, RedHat) or they facilitate such building (Gentoo). They also usually provide an installer which helps to install the OS onto a target machine. Making and supporting an OS is a very complicated task requiring a complex and intricate infrastructure (upload queues, build servers, a bug tracker, and archive servers, mailing list software etc etc etc) and staff. This obviously raises a high barrier for creating a new, from-scratch OS. For instance, Debian provides ca. 37k packages for some five hardware architectures — go figure how much work is put into supporting this stuff. Still, if someone thinks they need to create a new OS for whatever reason, it may be a good idea to use an existing foundation to build on. And this is exactly where OSes based on other OSes come into existence. For instance, Ubuntu builds upon Debian by just importing most packages from it and repackaging only a small subset of them, plus packaging their own, providing their own artwork, default settings, documentation etc. Note that there are variations to this “based on” thing. For instance, Debian fosters the creation of “pure blends” of itself: distributions which use Debian rather directly, and just add a bunch of packages and other stuff only useful for rather small groups of users such as those working in education or medicine or music industry etc. Another twist is that not all these OSes are based on Linux. For instance, Debian also provide FreeBSD and Hurd kernels. They have quite tiny user groups but anyway. Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.

Read the article
-[UIImage drawInRect:] / CGContextDrawImage() not releasing memory?

- by sohocoke

I wanted to easily blend a UIImage on top of another background image, so wrote a category method for UIImage, adapted from http://stackoverflow.com/questions/1309757/blend-two-uiimages : - (UIImage *) blendedImageOn:(UIImage *) backgroundImage { NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init]; UIGraphicsBeginImageContext(backgroundImage.size); CGRect rect = CGRectMake(0, 0, backgroundImage.size.width, backgroundImage.size.height); [backgroundImage drawInRect:rect]; [self drawInRect:rect]; UIImage* blendedImage = UIGraphicsGetImageFromCurrentImageContext(); UIGraphicsEndImageContext(); [pool release]; return blendedImage; } Unfortunately my app that uses the above method to load around 20 images and blend them with background and gloss images (so probably around 40 calls), is being jettisoned on the device. An Instruments session revealed that calls to malloc stemming from the calls to drawInRect: are responsible for the bulk of the memory usage. I tried replacing the drawInRect: messages with equivalent function calls to the function CGContextDrawImage but it didn't help. The AutoReleasePool was added after I found the memory usage problem; it also didn't make a difference. I'm thinking this is probably because I'm not using graphics contexts appropriately. Would calling the above method in a loop be a bad idea because of the number of contexts I create? Or did I simply miss something?

Read the article
NSIS takes ownership of IIS system files

- by Lucas

I recently encountered an issue with NSIS that I believe is related to an interaction with UAC, but I am at a loss to explain it and I do not know how to prevent it in the future. I have an installer that creates and removes IIS virtual directories using the NsisIIS plugin. The installer appeared worked correctly on my Windows 7 workstation. When the installer was run on a Windows 2008 R2 server it installed properly, but the uninstaller removed all of the virtual directories and put IIS is an unusable state; to the point that I had to remove the Default Web Site and re-add it. What I eventually found was that all of the IIS configuration files under C:\Windows\System32\inetsrv\config had a lock icon on them. Some investigation seem to indicate that this means a user account has taken ownership of the file, however all the files listed SYSTEM as the file owner. I did check a different server that I have not run the installer on, and it does not have the lock icon applied to the IIS files. I have also seen the same lock icon appear on other files that the NSIS installer creates. For instance, I have a Web.Config.tpl file that is processed using the NSIS ReplaceInFile which also appears with the lock icon after the installer finished. After I explicitly grant another user account access to the file, the lock icon goes away. I run the installer under the local Administrator account on the 2008 R2 server, so I do not get the UAC prompt. Here is the relevant code from the install.nsi file RequestExecutionLevel admin Section "Application" APP_SECTION SectionIn RO Call InstallApp SectionEnd Section "un.Uninstaller Section" Delete "$PROGRAMFILES\${PROGRAMFILESDIR}\Uninstall.exe" Call un.InstallApp SectionEnd Function InstallApp File /oname=Web.Config Web.Config.tpl !insertmacro ReplaceInFile Web.Config %CONNECTION_STRING% $CONNECTION_STRING FunctionEnd Function un.InstallApp ReadRegStr $0 HKLM "Software\${REGKEY}" "VirtualDir" NsisIIS::DeleteVDir "$0" Pop $0 FunctionEnd I have three questions stemming from this incident: How did this happen? How can I fix my installer to prevent it from happening again? How can I repair the permissions on the IIS config files.

Read the article
jQuery and MooTools Conflict

- by flipflopmedia

Okay, so I got jQuery to get along with MooTools with one script, by adding this at the top of the jQuery script: var $j = jQuery.noConflict(); and then replacing every: $( with $j( But how would you get MooTools to like the following script that using jQuery?? Thanks in advance for any input, Tracy //Fade In Content Viewer: By JavaScript Kit: http://www.javascriptkit.com var fadecontentviewer={ csszindex: 100, fade:function($allcontents, togglerid, selected, speed){ var selected=parseInt(selected) var $togglerdiv=$("#"+togglerid) var $target=$allcontents.eq(selected) if ($target.length==0){ //if no content exists at this index position (ie: stemming from redundant pagination link) alert("No content exists at page number "+selected+"!") return } if ($togglerdiv.attr('lastselected')==null || parseInt($togglerdiv.attr('lastselected'))!=selected){ var $toc=$("#"+togglerid+" .toc") var $selectedlink=$toc.eq(selected) $("#"+togglerid+" .next").attr('nextpage', (selected<$allcontents.length-1)? selected+1+'pg' : 0+'pg') $("#"+togglerid+" .prev").attr('previouspage', (selected==0)? $allcontents.length-1+'pg' : selected-1+'pg') $target.css({zIndex: this.csszindex++, visibility: 'visible'}) $target.hide() $target.fadeIn(speed) $toc.removeClass('selected') $selectedlink.addClass('selected') $togglerdiv.attr('lastselected', selected+'pg') } }, setuptoggler:function($allcontents, togglerid, speed){ var $toc=$("#"+togglerid+" .toc") $toc.each(function(index){ $(this).attr('pagenumber', index+'pg') }) var $next=$("#"+togglerid+" .next") var $prev=$("#"+togglerid+" .prev") $next.click(function(){ fadecontentviewer.fade($allcontents, togglerid, $(this).attr('nextpage'), speed) return false }) $prev.click(function(){ fadecontentviewer.fade($allcontents, togglerid, $(this).attr('previouspage'), speed) return false }) $toc.click(function(){ fadecontentviewer.fade($allcontents, togglerid, $(this).attr('pagenumber'), speed) return false }) }, init:function(fadeid, contentclass, togglerid, selected, speed){ $(document).ready(function(){ var faderheight=$("#"+fadeid).height() var $fadecontents=$("#"+fadeid+" ."+contentclass) $fadecontents.css({top: 0, left: 0, height: faderheight, visibility: 'hidden'}) fadecontentviewer.setuptoggler($fadecontents, togglerid, speed) setTimeout(function(){fadecontentviewer.fade($fadecontents, togglerid, selected, speed)}, 100) $(window).bind('unload', function(){ //clean up $("#"+togglerid+" .toc").unbind('click') $("#"+togglerid+" .next", "#"+togglerid+" .prev").unbind('click') }) }) } }

Read the article
Can you recommend a full-text search engine?

- by Jen

Can you recommend a full-text search engine? (Preferably open source) I have a database of many (though relatively short) HTML documents. I want users to be able to search this database by entering one or more search words in my C++ desktop application. Hence, I’m looking for a fast full-text search solution to integrate with my app. Ideally, it should: Skip common words, such as the, of, and, etc. Support stemming, i.e. search for run also finds documents containing runner, running and ran. Be able to update its index in the background as new documents are added to the database. Be able to provide search word suggestions (like Google Suggest) Have a well-documented API To illustrate, assume the database has just two documents: Document 1: This is a test of text search. Document 2: Testing is fun. The following words should be in the index: fun, search, test, testing, text. If the user types t in the search box, I want the application to be able to suggest test, testing and text (Ideally, the application should be able to query the search engine for the 10 most common search words starting with t). A search for testing should return both documents. Other points: I don't need multi-user support I don't need support for complex queries The database resides on the user's computer, so the indexing should be performed locally. Can you suggest a C or C++ based solution? (I’ve briefly reviewed CLucene and Xapian, but I’m not sure if either will address my needs, especially querying the search word indexes for the suggest feature).

Read the article
Detecting branch reintegration or merge in pre-commit script

- by Shawn Chin

Within a pre-commit script, is it possible (and if so, how) to identify commits stemming from an svn merge? svnlook changed ... shows files that have changed, but does not differentiate between merges and manual edits. Ideally, I would also like to differentiate between a standard merge and a merge --reintegrate. Background: I'm exploring the possibility of using pre-commit hooks to enforce SVN usage policies for our project. One of the policies state that some directories (such as /trunk) should not be modified directly, and changed only through the reintegration of feature branches. The pre-commit script would therefore reject all changes made to these directories apart from branch reintegrations. Any ideas? Update: I've explored the svnlook command, and the closest I've got is to detect and parse changes to the svn:mergeinfo property of the directory. This approach has some drawback: svnlook can flag up a change in properties, but not which property was changed. (a diff with the proplist of the previous revision is required) By inspecting changes in svn:mergeinfo, it is possible to detect that svn merge was run. However, there is no way to determine if the commits are purely a result of the merge. Changes manually made after the merge will go undetected. (related post: Diff transaction tree against another path/revision)

Read the article
Lucene wildcard queries

- by Javi

Hello, I have this question relating to Lucene. I have a form and I get a text from it and I want to perform a full text search in several fields. Suppose I get from the input the text "textToLook". I have a Lucene Analyzer with several filters. One of them is lowerCaseFilter, so when I create the index, words will be lowercased. Imagine I want to search into two fields field1 and field2 so the lucene query would be something like this (note that 'textToLook' now is 'texttolook'): field1: texttolook* field2:texttolook* In my class I have something like this to create the query. I works when there is no wildcard. String text = "textToLook"; String[] fields = {"field1", "field2"}; //analyser is the same as the one used for indexing Analyzer analyzer = fullTextEntityManager.getSearchFactory().getAnalyzer("customAnalyzer"); MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, analyzer); org.apache.lucene.search.Query queryTextoLibre = parser.parse(text); With this code the query would be: field1: texttolook field2:texttolook but If I set text to "textToLook*" I get field1: textToLook* field2:textToLook* which won't find correctly as the indexes are in lowercase. I have read in lucene website this: " Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing" My problem cannot be solved by setting the behaviour case insensitive cause my analyzer has other fields which for examples remove some suffixes of words. I think I can solve the problem by getting how the text would be after going through the filters of my analyzer, then I could add the "*" and then I could build the Query with MultiFieldQueryParser. So in this example I woud get "textToLower" and after being passed to to these filters I could get "texttolower". After this I could make "textotolower*". But, is there any way to get the value of my text variable after going through all my analyzer's filters? How can I get all the filters of my analyzer? Is this possible? Thanks

Read the article
Full-text search in C++

- by Jen

I have a database of many (though relatively short) HTML documents. I want users to be able to search this database by entering one or more search words in a C++ desktop application. Hence, I’m looking for a fast full-text search solution. Ideally, it should: Skip common words, such as the, of, and, etc. Support stemming, i.e. search for run also finds documents containing runner, running and ran. Be able to update its index in the background as new documents are added to the database. Be able to provide search word suggestions (like Google Suggest) To illustrate, assume the database has just two documents: Document 1: This is a test of text search. Document 2: Testing is fun. The following words should be in the index: fun, search, test, testing, text. If the user types t in the search box, I want the application to be able to suggest test, testing and text (Ideally, the application should be able to query the search engine for the 10 most common search words starting with t). A search for testing should return both documents. Can you suggest a C or C++ based solution? (I’ve briefly reviewed CLucene and Xapian, but I’m not sure if either will address my needs, especially querying the search word indexes for the suggest feature).

Read the article
Performing a SVD on tweets. Memory problem

- by plotti

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this: word1, word2, word3, ..., word14400 person1 1 2 0 1 person2 0 0 1 0 ... person650 It contains the word counts for each person. Like this I am getting characteristic vectors for each person. I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is: should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt? I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages.... -- So in general I am asking for advice how to perform a svd on such a corpus.

Read the article
Create Session Variable from different datasources?

- by Szafranamn

Currently I am developing a dynamic website using Dreamweaver cs5 with ColdFusion 9 and using Access to create my databases along with QuickBooks and QODBC to create database. I have established a login session variable stemming from the login page. This session variable is being drawn from one Datasource "Access" Table "Logininfo" Field "FullName" but I wanted to create another session variable either at this point or further into the member's page to use in a query sequence. This session variable would stem from another Datasoucre "QBs" Table "Invoice" Field "CustomerRefFullName" which is generated through Quickbooks and QODBC. I am not sure if this is possible but if it is how do I do it. I want to do this so I can query the Invoice database to upload the customer's Invoices unique to them onto their page. So it would have to be related to their login credentials. If there is another better route to take I would greatly appreciate the advice. Below is the login code if there is additional information needed let me know. This is my current thinking/plan to do what I wish to intend hence the need to create the session variable: I have another Datasource "QBs" with a Table "Invoice" when I create another webpage for the customer to see their invoice I need to create a recordset that accesses that Table. In order to do so I think the best way would some home convert the session.FullName (which came from Access Datasource, Logininfor Table) into a session.CustomerRefFullName (which would have to come from (Datasource: QBs Table: Invoice Field: CustomerRefFullName) that way I could set the query WHERE CustomerRefFullName and have each logged in user see their specific Invoices. So is there a way to turn the session variable off one datasource/table into a different sessionvariable off a new datasource/table even if it is unique just to that page??? <cfif IsDefined("FORM.username")> SELECT FullName, Username,Password,AccessLevels FROM Logininfo WHERE Username= AND Password=

Read the article
Entity Framework Decorator Pattern

- by Anthony Compton

In my line of business we have Products. These products can be modified by a user by adding Modifications to them. Modifications can do things such as alter the price and alter properties of the Product. This, to me, seems to fit the Decorator pattern perfectly. Now, envision a database in which Products exist in one table and Modifications exist in another table and the database is hooked up to my app through the Entity Framework. How would I go about getting the Product objects and the Modification objects to implement the same interface so that I could use them interchangeably? For instance, the kind of things I would like to be able to do: Given a Modification object, call .GetNumThings(), which would then return the number of things in the original object, plus or minus the number of things added by the modification. This question may be stemming from a pretty serious lack of exposure to the nitty-gritty of EF (all of my experience so far has been pretty straight-forward LOB Silverlight apps), and if that's the case, please feel free to tell me to RTFM. Thanks in advance!

Read the article
Loading XML file containing leading zeros with SSIS preserving the zeros

- by Compudicted

Visiting the MSDN SQL Server Integration Services Forum oftentimes I could see that people would pop up asking this question: “why I am not able to load an element from an XML file that contains zeros so the leading/trailing zeros would remain intact?”. I started to suspect that such a trivial and often-required operation perhaps is being misunderstood by the developer community. I would also like to add that the whole state of affairs surrounding the XML today is probably also going to be increasingly affected by a motion of people who dislike XML in general and many aspects of it as XSD and XSLT invoke a negative reaction at best. Nevertheless, XML is in wide use today and its importance as a bridge between diverse systems is ever increasing. Therefore, I deiced to write up an example of loading an arbitrary XML file that contains leading zeros in one of its elements using SSIS so the leading zeros would be preserved keeping in mind the goal on simplicity into a table in SQL Server database. To start off bring up your BIDS (running as admin) and add a new Data Flow Task (DFT). This DFT will serve as container to adding our XML processing elements (besides, the XML Source is not available anywhere else other than from within the DFT). Double-click your DFT and drag and drop the XMS Source component from the Tool Box’s Data Flow Sources. Now, let the fun begin! Being inspired by the upcoming Christmas I created a simple XML file with one set of data that contains an imaginary SSN number of Rudolph containing several leading zeros like 0000003. This file can be viewed here. To configure the XML Source of course it is quite intuitive to point it to our XML file, next what the XML source needs is either an embedded schema (XSD) or it can generate one for us. In lack of the one I opted to auto-generate it for me and I ended up with an XSD that looked like: <?xml version="1.0"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="XMasEvent"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" name="CaseInfo"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" name="ID" type="xs:unsignedByte" /> <xs:element minOccurs="0" name="CreatedDate" type="xs:unsignedInt" /> <xs:element minOccurs="0" name="LastName" type="xs:string" /> <xs:element minOccurs="0" name="FirstName" type="xs:string" /> <xs:element minOccurs="0" name="SSN" type="xs:unsignedByte" /> <!-- Becomes string -- > <xs:element minOccurs="0" name="DOB" type="xs:unsignedInt" /> <xs:element minOccurs="0" name="Event" type="xs:string" /> <xs:element minOccurs="0" name="ClosedDate" /> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> As an aside on the XML file: if your XML file does not contain the outer node (<XMasEvent>) then you may end up in a situation where you see just one field in the output. Now please note that the SSN element’s data type was chosen to be of unsignedByte (and this is for a reason). The reason is stemming from the fact all our figures in the element are digits, this is good, but this is not exactly what we need, because if we will attempt to load the data with this XSD then we are going to either get errors on the destination or most typically lose the leading zeros. So the next intuitive choice is to change the data type to string. Besides, if a SSIS package was already created based on this XSD and the data type change was done thereafter, one should re-set the metadata by right-clicking the XML Source and choosing “Advanced Editor” in which there is a refresh button at the bottom left which will do the trick. So far so good, we are ready to load our XML file, well actually yes, and no, in my experience typically some data conversion may be required. So depending on your data destination you may need to tweak the data types targeted. Let’s add a Data Conversion Task to our DFT. Your package should look like: To make the story short I only will cover the SSN field, so in my data source the target SQL Table has it as nchar(10) and we chose string in our XSD (yes, this is a big difference), under such circumstances the SSIS will complain. So will go and manipulate on the data type of SSN by making it Unicode String (DT_WSTR), World String per se. The conversion should look like: The peek at the Metadata: We are almost there, now all we need is to configure the destination. For simplicity I chose SQL Server Destination. The mapping is a breeze, F5 and I am able to insert my data into SQL Server now! Checking the zeros – they are all intact!

Read the article
Fulltext search for django : Mysql not so bad ? (vs sphinx, xapian)

- by Eric

I am studying fulltext search engines for django. It must be simple to install, fast indexing, fast index update, not blocking while indexing, fast search. After reading many web pages, I put in short list : Mysql MYISAM fulltext, djapian/python-xapian, and django-sphinx I did not choose lucene because it seems complex, nor haystack as it has less features than djapian/django-sphinx (like fields weighting). Then I made some benchmarks, to do so, I collected many free books on the net to generate a database table with 1 485 000 records (id,title,body), each record is about 600 bytes long. From the database, I also generated a list of 100 000 existing words and shuffled them to create a search list. For the tests, I made 2 runs on my laptop (4Go RAM, Dual core 2.0Ghz): the first one, just after a server reboot to clear all caches, the second is done juste after in order to test how good are cached results. Here are the "home made" benchmark results : 1485000 records with Title (150 bytes) and body (450 bytes) Mysql 5.0.75/Ubuntu 9.04 Fulltext : ========================================================================== Full indexing : 7m14.146s 1 thread, 1000 searchs with single word randomly taken from database : First run : 0:01:11.553524 next run : 0:00:00.168508 Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext : ========================================================================== Full indexing : 6m08.154s 1 thread, 1000 searchs with single word randomly taken from database : First run : 0:01:11.553524 next run : 0:00:00.168508 1 thread, 100000 searchs with single word randomly taken from database : First run : 9m09s next run : 5m38s 1 thread, 10000 random strings (random strings should not be found in database) : just after the 100000 search test : 0:00:15.007353 1 thread, boolean search : 1000 x (+word1 +word2) First run : 0:00:21.205404 next run : 0:00:00.145098 Djapian Fulltext : ========================================================================== Full indexing : 84m7.601s 1 thread, 1000 searchs with single word randomly taken from database with prefetch : First run : 0:02:28.085680 next run : 0:00:14.300236 python-xapian Fulltext : ========================================================================== 1 thread, 1000 searchs with single word randomly taken from database : First run : 0:01:26.402084 next run : 0:00:00.695092 django-sphinx Fulltext : ========================================================================== Full indexing : 1m25.957s 1 thread, 1000 searchs with single word randomly taken from database : First run : 0:01:30.073001 next run : 0:00:05.203294 1 thread, 100000 searchs with single word randomly taken from database : First run : 12m48s next run : 9m45s 1 thread, 10000 random strings (random strings should not be found in database) : just after the 100000 search test : 0:00:23.535319 1 thread, boolean search : 1000 x (word1 word2) First run : 0:00:20.856486 next run : 0:00:03.005416 As you can see, Mysql is not so bad at all for fulltext search. In addition, its query cache is very efficient. Mysql seems to me a good choice as there is nothing to install (I need just to write a small script to synchronize an Innodb production table to a MyISAM search table) and as I do not really need advanced search feature like stemming etc... Here is the question : What do you think about Mysql fulltext search engine vs sphinx and xapian ?

Read the article
European Interoperability Framework - a new beginning?

- by trond-arne.undheim

The most controversial document in the history of the European Commission's IT policy is out. EIF is here, wrapped in the Communication "Towards interoperability for European public services", and including the new feature European Interoperability Strategy (EIS), arguably a higher strategic take on the same topic. Leaving EIS aside for a moment, the EIF controversy has been around IPR, defining open standards and about the proper terminology around standardization deliverables. Today, as the document finally emerges, what is the verdict? First of all, to be fair to those among you who do not spend your lives in the intricate labyrinths of Commission IT policy documents on interoperability, let's define what we are talking about. According to the Communication: "An interoperability framework is an agreed approach to interoperability for organisations that want to collaborate to provide joint delivery of public services. Within its scope of applicability, it specifies common elements such as vocabulary, concepts, principles, policies, guidelines, recommendations, standards, specifications and practices." The Good - EIF reconfirms that "The Digital Agenda can only take off if interoperability based on standards and open platforms is ensured" and also confirms that "The positive effect of open specifications is also demonstrated by the Internet ecosystem." - EIF takes a productive and pragmatic stance on openness: "In the context of the EIF, openness is the willingness of persons, organisations or other members of a community of interest to share knowledge and stimulate debate within that community, the ultimate goal being to advance knowledge and the use of this knowledge to solve problems" (p.11). "If the openness principle is applied in full: - All stakeholders have the same possibility of contributing to the development of the specification and public review is part of the decision-making process; - The specification is available for everybody to study; - Intellectual property rights related to the specification are licensed on FRAND terms or on a royalty-free basis in a way that allows implementation in both proprietary and open source software" (p. 26). - EIF is a formal Commission document. The former EIF 1.0 was a semi-formal deliverable from the PEGSCO, a working group of Member State representatives. - EIF tackles interoperability head-on and takes a clear stance: "Recommendation 22. When establishing European public services, public administrations should prefer open specifications, taking due account of the coverage of functional needs, maturity and market support." - The Commission will continue to support the National Interoperability Framework Observatory (NIFO), reconfirming the importance of coordinating such approaches across borders. - The Commission will align its internal interoperability strategy with the EIS through the eCommission initiative. - One cannot stress the importance of using open standards enough, whether in the context of open source or non-open source software. The EIF seems to have picked up on this fact: What does the EIF says about the relation between open specifications and open source software? The EIF introduces, as one of the characteristics of an open specification, the requirement that IPRs related to the specification have to be licensed on FRAND terms or on a royalty-free basis in a way that allows implementation in both proprietary and open source software. In this way, companies working under various business models can compete on an equal footing when providing solutions to public administrations while administrations that implement the standard in their own software (software that they own) can share such software with others under an open source licence if they so decide. - EIF is now among the center pieces of the Digital Agenda (even though this demands extensive inter-agency coordination in the Commission): "The EIS and the EIF will be maintained under the ISA Programme and kept in line with the results of other relevant Digital Agenda actions on interoperability and standards such as the ones on the reform of rules on implementation of ICT standards in Europe to allow use of certain ICT fora and consortia standards, on issuing guidelines on essential intellectual property rights and licensing conditions in standard-setting, including for ex-ante disclosure, and on providing guidance on the link between ICT standardisation and public procurement to help public authorities to use standards to promote efficiency and reduce lock-in.(Communication, p.7)" All in all, quite a few good things have happened to the document in the two years it has been on the shelf or was being re-written, depending on your perspective, in any case, awaiting the storms to calm. The Bad - While a certain pragmatism is required, and governments cannot migrate to full openness overnight, EIF gives a bit too much room for governments not to apply the openness principle in full. Plenty of reasons are given, which should maybe have been put as challenges to be overcome: "However, public administrations may decide to use less open specifications, if open specifications do not exist or do not meet functional interoperability needs. In all cases, specifications should be mature and sufficiently supported by the market, except if used in the context of creating innovative solutions". - EIF does not use the internationally established terminology: open standards. Rather, the EIF introduces the notion of "formalised specification". How do "formalised specifications" relate to "standards"? According to the FAQ provided: The word "standard" has a specific meaning in Europe as defined by Directive 98/34/EC. Only technical specifications approved by a recognised standardisation body can be called a standard. Many ICT systems rely on the use of specifications developed by other organisations such as a forum or consortium. The EIF introduces the notion of "formalised specification", which is either a standard pursuant to Directive 98/34/EC or a specification established by ICT fora and consortia. The term "open specification" used in the EIF, on the one hand, avoids terminological confusion with the Directive and, on the other, states the main features that comply with the basic principle of openness laid down in the EIF for European Public Services. Well, this may be somewhat true, but in reality, Europe is 30 year behind in terminology. Unless the European Standardization Reform gets completed in the next few months, most Member States will likely conclude that they will go on referencing and using standards beyond those created by the three European endorsed monopolists of standardization, CEN, CENELEC and ETSI. Who can afford to begin following the strict Brussels rules for what they can call open standards when, in reality, standards stemming from global standardization organizations, so-called fora/consortia, dominate in the IT industry. What exactly is EIF saying? Does it encourage Member States to go on using non-ESO standards as long as they call it something else? I guess I am all for it, although it is a bit cumbersome, no? Why was there so much interest around the EIF? The FAQ attempts to explain: Some Member States have begun to adopt policies to achieve interoperability for their public services. These actions have had a significant impact on the ecosystem built around the provision of such services, e.g. providers of ICT goods and services, standardisation bodies, industry fora and consortia, etc... The Commission identified a clear need for action at European level to ensure that actions by individual Member States would not create new electronic barriers that would hinder the development of interoperable European public services. As a result, all stakeholders involved in the delivery of electronic public services in Europe have expressed their opinions on how to increase interoperability for public services provided by the different public administrations in Europe. Well, it does not take two years to read 50 consultation documents, and the EU Standardization Reform is not yet completed, so, more pragmatically, you finally had to release the document. Ok, let's leave some of that aside because the document is out and some people are happy (and others definitely not). The Verdict Considering the controversy, the delays, the lobbying, and the interests at stake both in the EU, in Member States and among vendors large and small, this document is pretty impressive. As with a good wine that has not yet come to full maturity, let's say that it seems to be coming in in the 85-88/100 range, but only a more fine-grained analysis, enjoyment in good company, and ultimately, implementation, will tell. The European Commission has today adopted a significant interoperability initiative to encourage public administrations across the EU to maximise the social and economic potential of information and communication technologies. Today, we should rally around this achievement. Tomorrow, let's sit down and figure out what it means for the future.

Read the article

Search Results

Search found 28 results on 2 pages for 'stemming'.

Page 1/2 | 1 2 | Next Page >

- by mike

- by HAXEN

- by AspOnMyNet

- by BuckWoody

- by ubeatlenine

- by kgrad

- by Wirsbro

- by Marcus Wood

- by eom

- by chrissb

- by Joost Schuur

- by Jason Fitzpatrick

- by sohocoke

- by Lucas

- by flipflopmedia

- by Jen

- by Shawn Chin

- by Javi

- by Jen

- by plotti

- by Szafranamn

- by Anthony Compton

- by Compudicted

- by Eric

- by trond-arne.undheim

1 2 | Next Page >