lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From henrib <>
Subject Re: Designing a multilingual index
Date Fri, 02 Apr 2010 08:32:28 GMT

I agree that if you dont know the "source" language - or can't determine it -
there is a lot of uncertainty in trying to transmogriphy the query from one
language to another!  TIKA and Nutch do have language determination tools
though (ngram profiles if I'm not mistaken). And you also can interact with
the end-user before issuing the query to confirm the language if
necessary("did you mean" kind of feature).
Assuming you can determine the query language and you do have "dictionaries"
of important terms per field, I tend to think you increase precision.

The simple route is to ignore the language, use ngrams, forget stemmers & al
and just fire; recall will likely be good, precision not that much.


View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message