lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elie Naulleau" <>
Subject RE: Automatically determin Language of document
Date Fri, 23 Nov 2001 14:13:37 GMT

You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.

I tried it and it works pretty well. It's based on a similarity mesure
between a corpus-model and the input text.

There are issues depending on which character set you used (iso-latin1,
or other asci flavor).

If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.

Of course commercial product are available for that (try Xerox & inXight
for instance $$).

The point is how many language you have to identify...

Some time ago, Bright Station (UK) had some open source C/C++ code
 for a variety of stemmers for european language (adapted from
the Porter stemmer approach).

I hope this helps,
Elie Naulleau

-----Message d'origine-----
De : Strittmatter Stephan (external)
Envoyé : vendredi 23 novembre 2001 14:56
À : 'Lucene Users List'
Objet : Automatically determin Language of document


has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message