lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: bytecount as String and prefix length
Date Wed, 02 Nov 2005 04:52:28 GMT

On Nov 1, 2005, at 9:51 AM, Doug Cutting wrote:

> Another approach might be to, instead of converting to UTF-8 to  
> strings right away, change things to convert lazily, if at all.
> During index merging such conversion should never be needed.


There ought to be some gains possible there, then.  No predictions as  
to how much, though.

> You needn't do this systematically throughout Lucene, but only  
> where it makes a big difference.  For example, if you could avoid  
> strings in SegmentMerger.mergeTermInfos() it might make a huge  
> difference.  This might be as simple as changing SegmentMergeInfo  
> to use a TermBuffer instead of a Term.  Does that make sense?

Abundant sense.  I'm not as familiar with SegmentMerger as I am with  
other parts of the org.apache.lucene.index package, because I haven't  
ported it yet.  But conceptually I understand exactly why this should  
require fewer resources.

I'll take a swing at SegmentMerger and submit a comprehensive diff.

Thanks for the suggestions,

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message