lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Looking for a code pattern to pass stop words as an attribute
Date Wed, 22 Aug 2012 06:51:14 GMT
Thanks for replies Steve, Uwe.

> if you dont want to create your own "marker filter", you can use KeywordMarkerFilter
(http://goo.gl/OOgf4) instead

This is pretty much what I had come up with, although I used a custom
filter class (with a similar attribute). The thing I have trouble with
is, however, that stop words may not be based on images but also on
other attributes. In particular, the Japanese pipeline uses _two_ term
suppression classes:

    stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);
    ...
    stream = new StopFilter(matchVersion, stream, stopwords);

Of course I can just copy/paste the source of these and build my own
keyword marker, this is clear to me. But I'd rather build a filter
that delegates to these original classes and aggregates their output
so that I don't have to rebuild things on every upgrade and this is
where I'm kind of stuck.  Something like:

if (!japanesePOS.accept() || !stopfilter.accept()) {
  // mark the current token as a stopword.
}

I'm just not sure if I can create such a non-linear filters pipeline
-- if this isn't going to confuse the attribute management code? Node
that the above filters (japanesePOS, blah) would _not_ be part of the
token stream, the would be attached to one of the filters. Don't know
if I'm clear.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message