Thursday, June 15, 2006

Automatic Language Identification from text.

In one of my recent projects, there was a business requirement to identify the language of a text document automatically and segregate them.

I tried to do some research on the internet and came up with some open-source tools that can help in identifying a language. One such popular tool is "Lingua" - open source and written in Pearl.

Language identification happens by searching for common patterns of that language. Those patterns can be prefixes, suffixes, common words, ngrams or even sequences of words. More information about n-grams can be found here.

Other interesting links on the same subject:
http://staff.science.uva.nl/~jvgemert/mia_page/LangTools.html
http://odur.let.rug.nl/~vannoord/TextCat/Demo/
http://staff.science.uva.nl/~jvgemert/mia_page/demo.html#Lid