lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: multiple instances of fields or attributes
Date Tue, 12 Feb 2008 20:08:10 GMT

Doron Cohen wrote:
> On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <> wrote:
>> ...
>> Does anyone have an example of how this works ?
>> (or an explanation in plain French-speaker-friendly tutorial-like English
>> ?)
> Do you mean "how to make it work for you" or "how does it work inside"?
> The first option is easier to explain (though I know no French :))
> When you create an IndexWritier you provide it an Analyzer.
> That analyzer is used when a document is added to the index.
> The analyzer.getPositionIncrementGap() specifies the position
> gap between separate additions of same field. By default it
> returns 0 (which is not working well in your example). To modify this
> you can override this method in "your" analyzer to return a nonzero gap,
> for example 5. This is easy when subclassing any existing analyzer.
> Doron

Now I may be starting to get it (although we French-speaking guys are 
slow (but thorough)).  Do you mean the following (add question mark at 
end) :
- imagine that I would create a field "descriptors" for each of my documents
- prior to adding a "phrase" to the "descriptors" field, I pass it 
through an Analyser, the Analyser breaks it down into words, and notes 
for each word the position in the phrase...
- then the Analyser feeds it into the index, where the individual words 
are stored, together with their relative position in the "phrase"...
- so that, for instance (ignoring any stripping of stopwords), the 
phrase "the white cat jumped over the sleeping dog"  is now stored in 
the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the 
7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions 
in the phrase/field..
- so that, if I later search for "white cat"~1 in "dsecriptors", it will 
find this document, bacause the "distance" between "white" and "cat" is 
1 (or 0, depending how one counts) ..
- now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my 
Analayser, then for the second addition to the same "descriptors" field, 
it will start the numbering at 19 (?).
- thus if for instance the second instance of "descriptors" is the 
phrase "the cow bit the cat", this will be indexed as "19:the 20:cow 
21:bit 22:the 23:cat".
- and when searching for "dog cow"~5, it would not find this document, 
because the gap betweeb "8:dog" and "20:cow" is greater than 5 ?

Is it something like that, or have I not got it at all ?

To generalise my question, what I would like to know is this : assuming 
I have two "descriptors" for the same document : "Electrical and 
Electronic Engineering" and "Engineering Studies".
Is there a way to index this document (among others), and to later do a 
search which will find the documents which have a "descriptors" 
containing both "Electronic" and "Studies" in the same instance of 
"descriptors", thus not finding this one ?


View raw message