lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Looking for a code pattern to pass stop words as an attribute
Date Wed, 22 Aug 2012 08:11:08 GMT
Yeah, this is exactly what I was thinking about and it even worked
(accept being protected is not a huge problem because these classes
are not final so I can open it up using a local subclass). I just
wasn't sure if this isn't too hacky. Thanks Uwe.

Dawid

On Wed, Aug 22, 2012 at 10:03 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> You could misuse the attributes API:
>
> All filters in a chain have the same attributes. This is achieved by the
> chaining (new TokenFilter(other TS) shares the attributes). What you could
> do to be non-linear in chaining:
>
>
>
> Create the "helpers" that are not part of the chain, by linking them to the
> input TokenStream, but never call incrementToken() on them. Their internals
> will always see the same attributes and attribute contents, so you could
> call accept() - if it would not be protected. The stream is controlled by
> our TokenFilter, so we incrementToken() only on ours, we just misuse the
> accept method (because it operates on the attributes we already populated by
> our own call to incrementToken()):
>
>
>
> stopwordMarkFilter = new TokenFilter(....) {
>
>                 private final markerAtt = addAttribute(...);
>
>                 private final FilteringTokenFilter japanesePOS = new new
> JapanesePartOfSpeechStopFilter(true, input, stoptags);
>
>                 private final FilteringTokenFilter stopfilter = new
> StopFilter(matchVersion, input, stopwords);
>
>
>
>                 public boolean incrementToken() {
>
>                                if (!input.incrementToken()) return false;
>
>                                if (!japanesePOS.accept() ||
> !stopfilter.accept()) {
>
>                                                // mark the current token as
> a stopword.
>
>
> markerAtt.setIsStopword(true);
>
>                                }
>
>                                return true;
>
>                 }
>
> }
>
>
>
> The only problem, as accept is not intended to be called from the outside,
> it is of course protected...
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
>> -----Original Message-----
>
>> From: dawid.weiss@gmail.com [mailto:dawid.weiss@gmail.com] On Behalf Of
>
>> Dawid Weiss
>
>> Sent: Wednesday, August 22, 2012 8:51 AM
>
>> To: dev@lucene.apache.org
>
>> Subject: Re: Looking for a code pattern to pass stop words as an attribute
>
>>
>
>> Thanks for replies Steve, Uwe.
>
>>
>
>> > if you dont want to create your own "marker filter", you can use
>
>> > KeywordMarkerFilter (http://goo.gl/OOgf4) instead
>
>>
>
>> This is pretty much what I had come up with, although I used a custom
>> filter
>
>> class (with a similar attribute). The thing I have trouble with is,
>> however, that
>
>> stop words may not be based on images but also on other attributes. In
>
>> particular, the Japanese pipeline uses _two_ term suppression classes:
>
>>
>
>>     stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags);
>
>>     ...
>
>>     stream = new StopFilter(matchVersion, stream, stopwords);
>
>>
>
>> Of course I can just copy/paste the source of these and build my own
>> keyword
>
>> marker, this is clear to me. But I'd rather build a filter that delegates
>> to these
>
>> original classes and aggregates their output so that I don't have to
>> rebuild
>
>> things on every upgrade and this is where I'm kind of stuck.  Something
>> like:
>
>>
>
>> if (!japanesePOS.accept() || !stopfilter.accept()) {
>
>>   // mark the current token as a stopword.
>
>> }
>
>>
>
>> I'm just not sure if I can create such a non-linear filters pipeline
>
>> -- if this isn't going to confuse the attribute management code? Node that
>> the
>
>> above filters (japanesePOS, blah) would _not_ be part of the token stream,
>> the
>
>> would be attached to one of the filters. Don't know if I'm clear.
>
>>
>
>> Dawid
>
>>
>
>> ---------------------------------------------------------------------
>
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>
>> commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message