How to determine the (natural) language of a document?

Posted by Robert Petermeier on Stack Overflow See other posts from Stack Overflow or by Robert Petermeier
Published on 2009-09-05T14:50:31Z Indexed on 2010/04/02 5:33 UTC
Read the original article Hit count: 489

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.

Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is

  1. Java
  2. Not free for "semi-commercial" usage

This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French...

A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

© Stack Overflow or respective owner

Related posts about .NET

Related posts about natural-language