How to determine the (natural) language of a document?
- by Robert Petermeier
I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.
Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is
Java
Not free for "semi-commercial" usage
This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French...
A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.