lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Looking for a code pattern to pass stop words as an attribute
Date Wed, 22 Aug 2012 08:22:01 GMT
All filters must be final and they are?:
public final class StopFilter extends FilteringTokenFilter
public final class JapanesePartOfSpeechStopFilter extends FilteringTokenFilter

In all cases you can move your special filter into the package of FilteringTokenFilter....
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: dawid.weiss@gmail.com [mailto:dawid.weiss@gmail.com] On Behalf Of
> Dawid Weiss
> Sent: Wednesday, August 22, 2012 10:11 AM
> To: dev@lucene.apache.org
> Subject: Re: Looking for a code pattern to pass stop words as an attribute
> 
> Yeah, this is exactly what I was thinking about and it even worked (accept being
> protected is not a huge problem because these classes are not final so I can
> open it up using a local subclass). I just wasn't sure if this isn't too hacky. Thanks
> Uwe.
> 
> Dawid
> 
> On Wed, Aug 22, 2012 at 10:03 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > You could misuse the attributes API:
> >
> > All filters in a chain have the same attributes. This is achieved by
> > the chaining (new TokenFilter(other TS) shares the attributes). What
> > you could do to be non-linear in chaining:
> >
> >
> >
> > Create the "helpers" that are not part of the chain, by linking them
> > to the input TokenStream, but never call incrementToken() on them.
> > Their internals will always see the same attributes and attribute
> > contents, so you could call accept() - if it would not be protected.
> > The stream is controlled by our TokenFilter, so we incrementToken()
> > only on ours, we just misuse the accept method (because it operates on
> > the attributes we already populated by our own call to incrementToken()):
> >
> >
> >
> > stopwordMarkFilter = new TokenFilter(....) {
> >
> >                 private final markerAtt = addAttribute(...);
> >
> >                 private final FilteringTokenFilter japanesePOS = new
> > new JapanesePartOfSpeechStopFilter(true, input, stoptags);
> >
> >                 private final FilteringTokenFilter stopfilter = new
> > StopFilter(matchVersion, input, stopwords);
> >
> >
> >
> >                 public boolean incrementToken() {
> >
> >                                if (!input.incrementToken()) return
> > false;
> >
> >                                if (!japanesePOS.accept() ||
> > !stopfilter.accept()) {
> >
> >                                                // mark the current
> > token as a stopword.
> >
> >
> > markerAtt.setIsStopword(true);
> >
> >                                }
> >
> >                                return true;
> >
> >                 }
> >
> > }
> >
> >
> >
> > The only problem, as accept is not intended to be called from the
> > outside, it is of course protected...
> >
> >
> >
> > -----
> >
> > Uwe Schindler
> >
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> >
> > http://www.thetaphi.de
> >
> > eMail: uwe@thetaphi.de
> >
> >
> >
> >> -----Original Message-----
> >
> >> From: dawid.weiss@gmail.com [mailto:dawid.weiss@gmail.com] On Behalf
> >> Of
> >
> >> Dawid Weiss
> >
> >> Sent: Wednesday, August 22, 2012 8:51 AM
> >
> >> To: dev@lucene.apache.org
> >
> >> Subject: Re: Looking for a code pattern to pass stop words as an
> >> attribute
> >
> >>
> >
> >> Thanks for replies Steve, Uwe.
> >
> >>
> >
> >> > if you dont want to create your own "marker filter", you can use
> >
> >> > KeywordMarkerFilter (http://goo.gl/OOgf4) instead
> >
> >>
> >
> >> This is pretty much what I had come up with, although I used a custom
> >> filter
> >
> >> class (with a similar attribute). The thing I have trouble with is,
> >> however, that
> >
> >> stop words may not be based on images but also on other attributes.
> >> In
> >
> >> particular, the Japanese pipeline uses _two_ term suppression classes:
> >
> >>
> >
> >>     stream = new JapanesePartOfSpeechStopFilter(true, stream,
> >> stoptags);
> >
> >>     ...
> >
> >>     stream = new StopFilter(matchVersion, stream, stopwords);
> >
> >>
> >
> >> Of course I can just copy/paste the source of these and build my own
> >> keyword
> >
> >> marker, this is clear to me. But I'd rather build a filter that
> >> delegates to these
> >
> >> original classes and aggregates their output so that I don't have to
> >> rebuild
> >
> >> things on every upgrade and this is where I'm kind of stuck.
> >> Something
> >> like:
> >
> >>
> >
> >> if (!japanesePOS.accept() || !stopfilter.accept()) {
> >
> >>   // mark the current token as a stopword.
> >
> >> }
> >
> >>
> >
> >> I'm just not sure if I can create such a non-linear filters pipeline
> >
> >> -- if this isn't going to confuse the attribute management code? Node
> >> that the
> >
> >> above filters (japanesePOS, blah) would _not_ be part of the token
> >> stream, the
> >
> >> would be attached to one of the filters. Don't know if I'm clear.
> >
> >>
> >
> >> Dawid
> >
> >>
> >
> >> ---------------------------------------------------------------------
> >
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For
> >> additional
> >
> >> commands, e-mail: dev-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message