lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <>
Subject AW: N-gram layer and language guessing
Date Tue, 03 Feb 2004 10:39:40 GMT


does anybody here use a ngram-layer for fault-tolerant searching 
on *larger* texts? I ask because you can expect to see far more 
ngrams than words emerging from a text once you use at least
quad-grams - and the number of different tokens indexed seems to 
be the most important parameter for Lucene's search speed.

Anyway, XtraMind's ngram language guesser gives the following 
best five results on the swedish examples discussed previously:

"jag heter kalle"

swedish 100,00 %
norwegian 17,51 %
danish 10,02 %
africaans 9,53 %
dutch 8,79 %

"vad heter du"

swedish 100,00 %
dutch 20,97 %
norwegian 14,68 %
danish 11,07 %
africaans 9,29 %

The guesser uses only tri- and quad-grams and is based on
a sophisticated machine learning algorithm instead of a raw
TF/IDF-weighting. The upside of this is the "confidence" 
value for estimating how much you can trust the 
classification. The downside is the model size: 5MB for 15 
languages, which comes mostly from using quad-grams - our 
machine learners don't do feature selection very well.

Mit freundlichen Grüßen aus Saarbrücken


Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109

-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [] 
Gesendet: Dienstag, 3. Februar 2004 09:27
An: Lucene Developers List
Betreff: Re: N-gram layer

karl wettin wrote:
> On Mon, 2 Feb 2004 20:10:57 +0100
> "Jean-Francois Halleux" <> wrote:
>>during the past days, I've developped such a language guesser myself 
>>as a basis for a Lucene analyzer. It is based on trigrams. It is 
>>already working but not yet in a "publishable" state. If you or others 
>>are interested I can offer the sources.
> I use variable gramsize due to the toughness of detecting thelanguage 
> of very small texts such as a query. For instance, applying 
> bi->quadgram on the swedish sentance "Jag heter Karl" (my name is 
> Karl) is presumed to be in Afrikaans. Using uni->quadgram does the 
> trick.
> Also, I add peneltys for gram-sized words found the the text but not 
> in the classified language. This improved my results even more.
> And I've been considering applying markov-chains on the grams where it 
> still is hard to guess the language, such as Afrikaans vs. Dutch and 
> American vs. Brittish English.
> Let me know if you want a copy of my code.
> Here is some testoutput:
> As you see, single word penalty on uni->quad does the trick on even 
> the smallest of textstrings.

Well, perhaps it's also a matter of the quality of the language 
profiles. In one of my projects I'm using language profiles constructed 
from 1-5 -grams, with total of 300 grams per language profile. I don't 
do any additional tricks with penalizing the high frequency words.

If I run the above example, I get the following:

  "jag heter kalle"
<input> - SV:   0.7197875
<input> - DN:   0.745925
<input> - NO:   0.747225
<input> - FI:   0.755475
<input> - NL:   0.7597125
<input> - EN:   0.76746875
<input> - FR:   0.77628125
<input> - GE:   0.7785125
<input> - IT:   0.796675
<input> - PL:   0.7984875
<input> - PT:   0.7995875
<input> - ES:   0.800775
<input> - RU:   0.88500625

However, for the text "vad heter du" (what's your name) the detection 
fails... :-)

A question: what was your source for the representative hi-frequency 
words in various languages? Was it your training corpus or some publication?

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message