lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Omitting TermVector info and index size
Date Wed, 14 Feb 2007 18:29:23 GMT
OK, final note. I wish I knew what kind of drugs I was on when I first
thought that the sizes were so much smaller. Because they weren't. I got to
thinking that "gee, it's kind of weird that if you don't specify anything
for TermVector when creating a field, you get all this advanced stuff. If it
was me, I'd default it to the simple case and make the user explicitly
invoke the more complex one".

So I re-ran my tests and Lo! the default when creating a field is
TermVector.NO! Which means that my original surprise is totally off-base,
and I WAS indexing with TermVector.NO since I didn't specify anything. So
why I thought the index was so much smaller, I have no clue. I'll have to
plead temporary insanity.

That said, I'm glad I have a better understanding of what this is all about
anyway. But I also have to apologize for wasting everyone's time.

Erick@RedInTheFace.org

On 2/14/07, Erick Erickson <erickerickson@gmail.com> wrote:
>
> It's always embarrassing when the correct unit test takes, say, 3 minutes
> to write and I've engaged in all this angst that I could have dispelled all
> by myself (although it is nice to have confirmation from folks in the know).
>
>
> The answer is that omitting term vectors has no influence on the behavior
> of span queries with custom positionincrementgap (s). My example performs
> exactly the same regardless.
>
> Thanks again all.
>
> Erick
>
> On 2/14/07, Erick Erickson <erickerickson@gmail.com> wrote:
> >
> > Thanks for that addition, it may well be important to me (as well as
> > pointing up a weakness in my unit tests. Honest, I've been thinking about
> > explicitly testing this. Really. I'll get around to it real soon now.
> > Truly....). We store multiple entries in the same field, think of it as
> > storing a list of content. For instance....
> >
> > this is subject entry one
> > this is subject entry two
> > this is subject entry three
> >
> > all stored by separate calls to doc.add("subject", .....);
> >
> > One consequence of this is that doing a span query, say "one this"~2
> > should NOT hit. So I've been setting the positionincrementgap to 1,000
> > between successive calls to doc.add.
> >
> > If 'm reading your e-mail correctly, this implies that I *do* need to
> > use WITH_POSITIONS. Correct? I know, I know. Just write the cursed unit test
> > and answer your own question alright already......
> >
> > And then Mark's e-mail leads me to think I *don't* need to use
> > WITH_POSITIONS. So I *will* write the unit test today. Honest. Oh, and Mark,
> > Erik said that, not Erick <G>....... He's undoubtedly much more handsome
> > than me, so wouldn't appreciate being confused....
> >
> > Thanks again all
> > Erick
> >
> > On 2/14/07, Grant Ingersoll < gsingers@apache.org> wrote:
> > >
> > > As Erik stated, you don't need term vectors to do spans, but I
> > > thought I would add a bit on the difference between positions and
> > > offsets.
> > >
> > > Positions are what is stored in Lucene internally (see
> > > Token.getPositionIncrement () and TermPositions) and are usually just
> > > consecutive integers (although they can be manipulated, as can the
> > > offsets), whereas offsets are the character offsets from the indexed
> > > text (see Token.startOffset() and Token.endOffset()).
> > >
> > > I haven't used the highlighter, but I think it does have options for
> > > working with term vectors so that you don't have to re-analyze
> > > everything, so there may be some performance benefit to storing them,
> > > at the cost of disk space, like you said.
> > >
> > > On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
> > >
> > > > I'm indexing books, with a significant amount of overhead in each
> > > > document
> > > > and a LOT of OCR data. I'm indexing over 20,000 books and the index
> > > > size is
> > > > 8G. So I decided to play around with not storing some of the
> > > > termvector
> > > > information and I'm shocked at how much smaller the index is. By
> > > > storing all
> > > > my fields with Field.TermVector.WITH_POSITIONS, my index is reduced
> > > > by OVER
> > > > 75%. It went from 485M to 100M for my sample of 1,000 documents.
> > > Which
> > > > implies my full index will be somewhere around 2G (I'll build the
> > > > full index
> > > > tonight and see).
> > > >
> > > > My reasoning was that I do need position information since I need
> > > > to do Span
> > > > queries,  but character information (WITH_OFFSETS) isn't necessary
> > > > here/now.
> > > > So I thought I'd make a small test to see if this was worth
> > > > pursuing. If
> > > > omitting offsets had only saved me 10%, for instance, I wouldn't
> > > > pursue it
> > > > very much. But 75+% is a savings well worth pursuing.
> > > >
> > > > All of my unit tests run, some of which include spans and
> > > > highlighting.
> > > > Whether they're sophisticated enough to catch some subtle issue I
> > > > won't
> > > > guarantee.
> > > >
> > > > I do NOT need to reconstruct the text, nor do I need to highlight
> > > with
> > > > what's in the index, I handle highlighting by putting my display
> > > > data in a
> > > > MemoryIndex and running a query against that. I play some fun games
> > > to
> > > > correlate my display and MemoryIndex info, but that's another
> > > > story. Many
> > > > thanks for the MemoryIndex contribution!!!
> > > >
> > > > With that as a background, I have two questions....
> > > > 1> Am I going off a cliff here? I suppose this is really answered by
> > >
> > > > 2> what is the difference between WITH_POSITIONS and WITH_OFFSETS
> > > > and YES
> > > > and NO? I assume that WITH_POSITIONS is necessary for Span queries,
> > > > for
> > > > instance, which is all I really care about. While this has been
> > > > discussed, I
> > > > searched and didn't find a satisfactory answer (or at least an
> > > > answer that I
> > > > understood<G>).
> > > >
> > > > I looked at Grants PowerPoint presentation and I guess I'm really
> > > > looking
> > > > for confirmation of my interpretation that WITH_POSITIONS lets me
> > > > do span
> > > > queries and WITH_OFFSETS is irrelevant in my situation, one where I
> > > > don't
> > > > highlight and don't need to reconstruct the document......
> > > >
> > > > Many thanks
> > > > Erick
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > Center for Natural Language Processing
> > > http://www.cnlp.org
> > >
> > > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> > > LuceneFAQ
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message