lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: N-gram layer
Date Tue, 03 Feb 2004 08:27:25 GMT
karl wettin wrote:
> On Mon, 2 Feb 2004 20:10:57 +0100
> "Jean-Francois Halleux" <> wrote:
>>during the past days, I've developped such a language guesser myself
>>as a basis for a Lucene analyzer. It is based on trigrams. It is
>>already working but not yet in a "publishable" state. If you or others
>>are interested I can offer the sources.
> I use variable gramsize due to the toughness of detecting thelanguage of
> very small texts such as a query. For instance, applying bi->quadgram on
> the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to
> be in Afrikaans. Using uni->quadgram does the trick.
> Also, I add peneltys for gram-sized words found the the text but not in
> the classified language. This improved my results even more. 
> And I've been considering applying markov-chains on the grams where it
> still is hard to guess the language, such as Afrikaans vs. Dutch and
> American vs. Brittish English.
> Let me know if you want a copy of my code. 
> Here is some testoutput:
> As you see, single word penalty on uni->quad does the trick on even the
> smallest of textstrings.

Well, perhaps it's also a matter of the quality of the language 
profiles. In one of my projects I'm using language profiles constructed 
from 1-5 -grams, with total of 300 grams per language profile. I don't 
do any additional tricks with penalizing the high frequency words.

If I run the above example, I get the following:

  "jag heter kalle"
<input> - SV:   0.7197875
<input> - DN:   0.745925
<input> - NO:   0.747225
<input> - FI:   0.755475
<input> - NL:   0.7597125
<input> - EN:   0.76746875
<input> - FR:   0.77628125
<input> - GE:   0.7785125
<input> - IT:   0.796675
<input> - PL:   0.7984875
<input> - PT:   0.7995875
<input> - ES:   0.800775
<input> - RU:   0.88500625

However, for the text "vad heter du" (what's your name) the detection 
fails... :-)

A question: what was your source for the representative hi-frequency 
words in various languages? Was it your training corpus or some publication?

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message