lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: CompressingTermVectors; per-field decompress?
Date Thu, 02 Apr 2015 19:21:55 GMT
Vectors are totally per-document. Its hard to do anything smarter with
them. Basically by this i mean, IMO vectors aren't going to get better
until the semantics around them improves. From the original
fileformats, i get the impression they were modelled after stored
fields a lot, and I think thats why they will be as slow as stored
fields until things are fixed.

* removing the embedded per-document schema of vectors. I can't
imagine a use case for this. I think in general you either have
vectors for docs in a given field X or you do not.
* removing the ability to store broken offsets (going backward, etc)
into vectors.
* removing the ability to store offsets without positions. Why?

As far as the current impl, its fallen behind the stored fields, which
got a lot of improvements for 5.0. We at least gave it a little love,
it has a super-fast bulk merge when no deletions are present
(dirtyChunks, totalChunks, etc).  But in all other cases its very
expensive.

Compression block sizes, etc should be tuned. It should implement
getMergeInstance() and keep state to avoid shittons of decompressions
on merge. Maybe a high compression option should be looked at, though
getMergeInstance() should be a prerequisite for that or it will be too
slow. When the format matches between readers (typically the case,
except when upgrading from older versions etc), it should avoid
deserialization overhead if that is costly (still the case for stored
fields).

Fixing some of the big problems (lots of metadata/complexity needed
for embedded schema info, negative numbers when there should not be)
with vectors would also enable better compression, maybe even
underneath LZ4, like stored fields got in 5.0 too.


On Thu, Apr 2, 2015 at 2:51 PM, david.w.smiley@gmail.com
<david.w.smiley@gmail.com> wrote:
> I was looking at a JIRA issue someone posted pertaining to optimizing
> highlighting for when there are term vectors ( SOLR-5855 ).  I dug into the
> details a bit and learned something unexpected:
> CompressingTermVectorsReader.get(docId) fully loads all term vectors for the
> document.  The client/user consuming code in question might just want the
> term vectors for a subset of all fields that have term vectors.  Was this
> overlooked or are there benefits to the current approach?  I can’t think of
> any except that perhaps there’s better compression over all the data versus
> in smaller per-field chunks; although I’d trade that any day over being able
> to just get a subset of fields.  I could imagine it being useful to ask for
> some fields or all — in much the same way we handle stored field data.
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message