lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Re: TokenStream and Token APIs
Date Tue, 21 Oct 2008 05:39:49 GMT
Grant Ingersoll wrote:
> On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:
>> Grant Ingersoll wrote:
>>> On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:
>>>> Grant Ingersoll wrote:
>>>>> Bear with me, b/c I'm not sure I'm following, but looking at 
>>>>>, I see at least 
>>>>> 5 different implemented Attributes.
>>>>> So, let's say I add a 5 more attributes and now have a total of 10 
>>>>> attributes. Are you saying that I then would have, potentially, 10 
>>>>> different variables that all point to the token as in the code 
>>>>> snippet above where the casting takes place? Or would I just 
>>>>> create a single "Super" attribute that folds in all of my new 
>>>>> attributes, plus any other existing ones? Or, maybe, what I would 
>>>>> do is create the 5 new attributes and then 1 new attribute that 
>>>>> extends all 10, thus allowing me to use them individually, but 
>>>>> saving me from having to do a whole ton of casting in my Consumer.
>>>> Potentially one consumer doing 10 things, but not likely right? I 
>>>> mean, things will stay logical as they are now, and rather than a 
>>>> super consumer doing everything, we will still have a chain of 
>>>> consumers each doing its own piece. So more likely, maybe something 
>>>> comes along every so often (another 5, over *much* time, say) and 
>>>> each time we add a Consumer that uses one or two TokenStream types. 
>>>> And then its just an implementation detail on whether you make a 
>>>> composite TokenStream - if you have added 10 new attributes and see 
>>>> it fit to make one consumer use them all, sure, make a composite, 
>>>> super type, but in my mind, the way its done in the example code is 
>>>> clearer/cleaner for a handful of TokenStream types. And even if you 
>>>> do make the composite,super type, its likely to just be a sugar 
>>>> wrapper anyway - the implementation for say, payload and positions, 
>>>> should probably be maintained in their own classes anyway.
>>> Well, there are 5 different attributes already, all of which are 
>>> commonly used.  Seems weird to have to cast the same var 5 different 
>>> ways.  Definitely agree that one would likely deal with this by 
>>> wrapping, but then you end up either needing to extend your wrapper 
>>> or add new wrappers...
>> Well yes, there are 5 attributes, but n neither of the core 
>> tokenstreams and -filters that I changed in my patch did I have to 
>> use more than two or three of those. Currently the only attributes 
>> that are really used are PositionIncrementAttribute and 
>> PayloadAttribute. And the OffsetAttribute when TermVectors are turned 
>> on.
>> Even in the indexing chain currently we don't have a single consumer 
>> that needs all attributes. The FreqProxWriter needs positions and 
>> payloads, the TermVectorsWriter needs positions and offsets.
> I have an application that uses all the attributes of a Token, or at 
> least, almost all of them.  There are many uses for Lucene's analysis 
> code that have nothing to do with indexing, Consumers or even Lucene.
>> Also, you don't have to cast the same variable multiple times. In the 
>> current patch you would call e. g. 
>> token.getAttribute(PayloadAttribute.class) and keep a reference to it 
>> in the consumer or filter.
>> IMO even calling getAttribute() 5 times or so and storing the 
>> references wouldn't be so bad. And if you really don't like it you 
>> could make a wrapper as you said. You also mentioned the 
>> disadvantages of the wrapper, e. g. that you would have to extend it 
>> to add new attributes. But then, isn't that the same disadvantage the 
>> current Token API has?
> True.  I didn't say the idea was bad, in fact I mostly like it, I was 
> just saying I'd like to explore how it would work in practice and the 
> main thing that struck me was all the casting or all the references.  
> Since it's likely that you only deal with a Token one at a time, 
> you're right, it's probably not a big deal other than the code looks 
> funny, IMO.
>> You could even use the new API in exact the same way as the old one. 
>> Just create a subclass of Token that has all members you need and 
>> don't add any attributes.
>> So I think the new API adds more flexibility, and still offers to use 
>> it in the same way as the old one. I however think the recommended 
>> best practice should be to use the new attributes, for reusability of 
>> consumers that only need certain attributes.
> Perhaps it would be useful for Lucene to offer exactly one subclass of 
> Token that we guarantee will always have all known Attributes (i.e. 
> the ones Lucene provides)  available to it for casting purposes.

Yeah we could do that. In fact, I did exactly this when I started 
working on this patch. I created a class called PlainToken, which had 
all the termBuffer and attributes logic, and changed Token to extend it. 
Then the new getToken() method would return an instance of PlainToken. 
My main concern with this approach is that it will make the code in the 
indexer more complicated, because it always has to check if we have a 
Token or PlainToken; if it's a Token then it has to use the get*() 
method directly, for a PlainToken it has tocheck for the *Attributes. So 
that's a bit messy (it's in fact exactly like that in the current patch 
for backwards-compatibility, but we could clean this up in 3.0). So for 
code simplicity I'm slightly in favor of not creating the a class that 
implements a default set of functionality without Attributes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message