lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <>
Subject Stemming - limited index expansion
Date Tue, 12 Jun 2012 19:07:24 GMT
As others have previously proposed on this list, I am interesting in inserting a second token
at some positions in my index.  I'll call this Limited Index Expansion.
I want to retain the original token, so that I can score an original word that matches in
a text better than just any synonym/stem etc.  Maybe I'll even do this with payloads (on the
2nd token?).
If I didn't keep the original word all I would be doing is a limited index time "reduction".
 Saving the original word and sometimes a lemma/stem (or something else), means I anticipate
at most two tokens at a position in the index.

I couldn't find a nearly-right high-level Filter that I could use to add logic to call a stemmer
and conditionally add another token.  Any suggestions?
One idea I had is that adding a second token is much like what a SynonymFilter does, but yikes
I was starting to grok PendingInputs, PendingOutputs,
but wasn't getting very far reading through SynonymMap and its BytesRefHash etc.  Obviously
it is written to be very good with memory very and fast, but it looks a bit tricky to extend
for other sources of "synonyms". It is too bad that the get synonym part of the operation
is not encapsulated in something pluggable or overridable, so I could just return an appropriate
array of CharRefs.  The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could use by sub-classing,
modifying, plugging, or just as a good example to which I might add my additional code to
add another token?
Building Filters is new to me, but right now nothing is jumping out at me as a basis for such
a Filter.  Any suggestions?  Did I miss something in core or contrib?
Is there some other combination of buffering, copying, sinking etc filters that I'm missing
what I should use to build a filter chain that would aid this process?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message