lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Lexicon access questions
Date Thu, 01 Jun 2006 10:10:29 GMT

We have faced the following use case:

In order to optimize performance and more importantly quality of search results we are forced
to attach more attributes to particular words (Terms). Generic attributes like TF, IDF are
usefull to model our "similarity" only up to some level. 

1. Is one Term first or last name, (e.g. we have comprehensive list of such words). This enables
us to make smarter (faster and better queries) in case someone has multiple first names, it
influences ranking...
2. Agreement weight and Disagreement weigt of some words is modelled diferently. 
3. Semantic classes of words influence ranking (if something verb or noun changes search strategy
and ranking radically)

On top of that, we can afford to load all terms in memory, in order to alow fast string distance
callculations and some limited pattern matching using some strange Trie-s. 

Today, we solve these things by implementing totally redundant data structures that keep some
kind of map Term->ValuesObject, which is redundant to Lucene Lexicon storage. Instead of
"one access gets all" we have two access terms using two diferent access paths, once using
our dictionary and second time implicitly via Query or so... So we introduce performance/memory
penalties. (Pls. do not forget, we need to access copy of analyzed document in order to attach
"additional info" to Terms)

I guess we are not the only ones to face such a case, as increase in precision above TF/IDF
can be only achieved by introducing some "domain semantics" where available. For this, "attaching"
domain specific info to Term would be perfect solution. Also, enabling flexible implementations
for Lexicon access could give us some flexibility (e.g. implementation in mg4j goes in that

Could somebody imagine 2.x version of Lucene to have some Interface that needs to be implemented
with clear contract, that would enable us to attach our implementation for accessing lexicon?

Or even better, some hints how I can do it today :)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message