lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Omitting TermVector info and index size
Date Wed, 14 Feb 2007 14:36:15 GMT
As Erik stated, you don't need term vectors to do spans, but I  
thought I would add a bit on the difference between positions and  
offsets.

Positions are what is stored in Lucene internally (see  
Token.getPositionIncrement() and TermPositions) and are usually just  
consecutive integers (although they can be manipulated, as can the  
offsets), whereas offsets are the character offsets from the indexed  
text (see Token.startOffset() and Token.endOffset()).

I haven't used the highlighter, but I think it does have options for  
working with term vectors so that you don't have to re-analyze  
everything, so there may be some performance benefit to storing them,  
at the cost of disk space, like you said.

On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:

> I'm indexing books, with a significant amount of overhead in each  
> document
> and a LOT of OCR data. I'm indexing over 20,000 books and the index  
> size is
> 8G. So I decided to play around with not storing some of the  
> termvector
> information and I'm shocked at how much smaller the index is. By  
> storing all
> my fields with Field.TermVector.WITH_POSITIONS, my index is reduced  
> by OVER
> 75%. It went from 485M to 100M for my sample of 1,000 documents. Which
> implies my full index will be somewhere around 2G (I'll build the  
> full index
> tonight and see).
>
> My reasoning was that I do need position information since I need  
> to do Span
> queries,  but character information (WITH_OFFSETS) isn't necessary  
> here/now.
> So I thought I'd make a small test to see if this was worth  
> pursuing. If
> omitting offsets had only saved me 10%, for instance, I wouldn't  
> pursue it
> very much. But 75+% is a savings well worth pursuing.
>
> All of my unit tests run, some of which include spans and  
> highlighting.
> Whether they're sophisticated enough to catch some subtle issue I  
> won't
> guarantee.
>
> I do NOT need to reconstruct the text, nor do I need to highlight with
> what's in the index, I handle highlighting by putting my display  
> data in a
> MemoryIndex and running a query against that. I play some fun games to
> correlate my display and MemoryIndex info, but that's another  
> story. Many
> thanks for the MemoryIndex contribution!!!
>
> With that as a background, I have two questions....
> 1> Am I going off a cliff here? I suppose this is really answered by
> 2> what is the difference between WITH_POSITIONS and WITH_OFFSETS  
> and YES
> and NO? I assume that WITH_POSITIONS is necessary for Span queries,  
> for
> instance, which is all I really care about. While this has been  
> discussed, I
> searched and didn't find a satisfactory answer (or at least an  
> answer that I
> understood<G>).
>
> I looked at Grants PowerPoint presentation and I guess I'm really  
> looking
> for confirmation of my interpretation that WITH_POSITIONS lets me  
> do span
> queries and WITH_OFFSETS is irrelevant in my situation, one where I  
> don't
> highlight and don't need to reconstruct the document......
>
> Many thanks
> Erick

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message