lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Looking for a code pattern to pass stop words as an attribute
Date Wed, 22 Aug 2012 08:03:45 GMT
You could misuse the attributes API:

All filters in a chain have the same attributes. This is achieved by the chaining (new TokenFilter(other
TS) shares the attributes). What you could do to be non-linear in chaining:

 

Create the "helpers" that are not part of the chain, by linking them to the input TokenStream,
but never call incrementToken() on them. Their internals will always see the same attributes
and attribute contents, so you could call accept() - if it would not be protected. The stream
is controlled by our TokenFilter, so we incrementToken() only on ours, we just misuse the
accept method (because it operates on the attributes we already populated by our own call
to incrementToken()):

 

stopwordMarkFilter = new TokenFilter(....) {

                private final markerAtt = addAttribute(...);

                private final FilteringTokenFilter japanesePOS = new new JapanesePartOfSpeechStopFilter(true,
input, stoptags);

                private final FilteringTokenFilter stopfilter = new StopFilter(matchVersion,
input, stopwords);

 

                public boolean incrementToken() {

                               if (!input.incrementToken()) return false;

                               if (!japanesePOS.accept() || !stopfilter.accept()) {

                                               // mark the current token as a stopword.

                                               markerAtt.setIsStopword(true);

                               }

                               return true;

                }

}

 

The only problem, as accept is not intended to be called from the outside, it is of course
protected...

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

> -----Original Message-----

> From: dawid.weiss@gmail.com [mailto:dawid.weiss@gmail.com] On Behalf Of

> Dawid Weiss

> Sent: Wednesday, August 22, 2012 8:51 AM

> To: dev@lucene.apache.org

> Subject: Re: Looking for a code pattern to pass stop words as an attribute

> 

> Thanks for replies Steve, Uwe.

> 

> > if you dont want to create your own "marker filter", you can use

> > KeywordMarkerFilter ( <http://goo.gl/OOgf4> http://goo.gl/OOgf4) instead

> 

> This is pretty much what I had come up with, although I used a custom filter

> class (with a similar attribute). The thing I have trouble with is, however, that

> stop words may not be based on images but also on other attributes. In

> particular, the Japanese pipeline uses _two_ term suppression classes:

> 

>     stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);

>     ...

>     stream = new StopFilter(matchVersion, stream, stopwords);

> 

> Of course I can just copy/paste the source of these and build my own keyword

> marker, this is clear to me. But I'd rather build a filter that delegates to these

> original classes and aggregates their output so that I don't have to rebuild

> things on every upgrade and this is where I'm kind of stuck.  Something like:

> 

> if (!japanesePOS.accept() || !stopfilter.accept()) {

>   // mark the current token as a stopword.

> }

> 

> I'm just not sure if I can create such a non-linear filters pipeline

> -- if this isn't going to confuse the attribute management code? Node that the

> above filters (japanesePOS, blah) would _not_ be part of the token stream, the

> would be attached to one of the filters. Don't know if I'm clear.

> 

> Dawid

> 

> ---------------------------------------------------------------------

> To unsubscribe, e-mail:  <mailto:dev-unsubscribe@lucene.apache.org> dev-unsubscribe@lucene.apache.org
For additional

> commands, e-mail:  <mailto:dev-help@lucene.apache.org> dev-help@lucene.apache.org


Mime
View raw message