lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: CompressingTermVectors; per-field decompress?
Date Thu, 02 Apr 2015 20:02:02 GMT
Thanks for your input Rob…

On Thu, Apr 2, 2015 at 3:21 PM, Robert Muir <> wrote:

> Vectors are totally per-document. Its hard to do anything smarter with
> them. Basically by this i mean, IMO vectors aren't going to get better
> until the semantics around them improves. From the original
> fileformats, i get the impression they were modelled after stored
> fields a lot, and I think thats why they will be as slow as stored
> fields until things are fixed.

They are fundamentally per-document, yes, like stored fields — yes.  But I
don’t see how this fundamental constraint prevents the term vector format
from returning a light “Fields” instance which loads per-field data on
demand when asked for.

I understand most of your ideas for a better term vector format below, to
varying degrees, but again I don’t see these ideas as being blocking
factors for having field term data be stored together so it could be
accessed lazily. (don’t fetch fields you don’t need). Maybe you didn’t mean
to imply they are?  Although I think you did by saying “vectors aren't
going to get better until the semantics around them improves”.

p.s. my term-vector feature wish-list includes an FST based term dictionary
to help make the Terms instance support more features like automaton
intersection & easy O(Log(N)) lookup.

~ David

* removing the embedded per-document schema of vectors. I can't
> imagine a use case for this. I think in general you either have
> vectors for docs in a given field X or you do not.
> * removing the ability to store broken offsets (going backward, etc)
> into vectors.
> * removing the ability to store offsets without positions. Why?
> As far as the current impl, its fallen behind the stored fields, which
> got a lot of improvements for 5.0. We at least gave it a little love,
> it has a super-fast bulk merge when no deletions are present
> (dirtyChunks, totalChunks, etc).  But in all other cases its very
> expensive.
> Compression block sizes, etc should be tuned. It should implement
> getMergeInstance() and keep state to avoid shittons of decompressions
> on merge. Maybe a high compression option should be looked at, though
> getMergeInstance() should be a prerequisite for that or it will be too
> slow. When the format matches between readers (typically the case,
> except when upgrading from older versions etc), it should avoid
> deserialization overhead if that is costly (still the case for stored
> fields).
> Fixing some of the big problems (lots of metadata/complexity needed
> for embedded schema info, negative numbers when there should not be)
> with vectors would also enable better compression, maybe even
> underneath LZ4, like stored fields got in 5.0 too.
> On Thu, Apr 2, 2015 at 2:51 PM,
> <> wrote:
> > I was looking at a JIRA issue someone posted pertaining to optimizing
> > highlighting for when there are term vectors ( SOLR-5855 ).  I dug into
> the
> > details a bit and learned something unexpected:
> > CompressingTermVectorsReader.get(docId) fully loads all term vectors for
> the
> > document.  The client/user consuming code in question might just want the
> > term vectors for a subset of all fields that have term vectors.  Was this
> > overlooked or are there benefits to the current approach?  I can’t think
> of
> > any except that perhaps there’s better compression over all the data
> versus
> > in smaller per-field chunks; although I’d trade that any day over being
> able
> > to just get a subset of fields.  I could imagine it being useful to ask
> for
> > some fields or all — in much the same way we handle stored field data.
> >
> > ~ David Smiley
> > Freelance Apache Lucene/Solr Search Consultant/Developer
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message