lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: Anticipating a benchmark for direct posting format
Date Mon, 07 Apr 2014 14:58:24 GMT
The only problem is how the Codec makes a dynamic decision on whether to
use the wrapped Codec for reading vs pre-load data into in-memory
structures, because Codecs are loaded through reflection by the SPI loading

There is also a TODO in DirectPF to allow wrapping arbitrary PFs, just
mentioning in case you want to tackle DPF.

I think that if we allowed passing something like a CodecLookupService,
with an SPILookupService default impl, you could easily pass that to
DirectoryReader which will use your runtime logic to load the right PF
(e.g. DPF) instead of the one the index was created with.

But it sounds like the core problem is that when we load a Codec/PF/DVF for
reading, we cannot pass it any arguments, and so we must make an index-time
decision about how we're going to read the data later on. If we could
somehow support that, I think that will help you to achieve what you want

E.g. currently it's an all-or-nothing decision, but if we could pass a
parameter like "50% available heap", the Codec/PF/DVF could cache the
frequently accessed postings instead of loading all of them into memory.
But, that can also be achieved at the IndexReader level, through a custom
FilterAtomicReader. And if you could reuse DPF's structures (like
DirectTermsEnum, DirectFields...), it should be easier to do this. So
perhaps we can think about a DirectAtomicReader which does that? I believe
it can share some code w/ DPF, as long as we don't make these APIs public,
or make them @super.experimental and

Just throwing some ideas...


On Mon, Apr 7, 2014 at 5:35 PM, <> wrote:

> Benson, I like your idea.
> I think your idea can be achieved as a codec, one that wraps another codec
> that establishes the on-disk format.  By default the wrapped codec can be
> Lucene's default codec.  I think, if implemented, this would be a change to
> DPF instead of an additional DPF-variant codec.
> ~ David
> On Mon, Apr 7, 2014 at 9:22 AM, Benson Margulies <>wrote:
>> On Mon, Apr 7, 2014 at 9:14 AM, Robert Muir <> wrote:
>> > On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies <
>>> wrote:
>> >
>> >>
>> >> My takeaway from the prior conversation was that various people didn't
>> >> entirely believe that I'd seen a dramatic improvement in query perfo
>> >> using D-P-F, and so would not smile upon a patch intended to liberate
>> >> D-P-F from codecs. It could be that the effect I saw has to do with
>> >> the fact that our system depends on hitting and scoring 50% of the
>> >> documents in an index with a lot of documents.
>> >>
>> >
>> > I dont understand the word "liberate" here. why is it such a problem
>> > that this is a codec?
>>  I don't want to have to declare my intentions at the time I create
>> the index. I don't want to have to use D-P-F for all readers all the
>> time. Because I want to be able to decide to open up an index with an
>> arbitrary on-disk format and get the in-memory cache behavior of
>> D-P-F. Thus 'liberate' -- split the question of 'keep a copy in
>> memory' from the choice of the on-disk format.
>> >
>> > i do not think we should give it any more status than that, it wastes
>> > too much ram.
>> It didn't seem like 'waste' when it solved a big practical for us. We
>> had an application that was too slow, and had plenty of RAM available,
>> and we were able to trade space for time by applying D-P-F.
>> Maybe I'm going about this backwards; if I can come up with a small,
>> inconspicuous proposed change that does what I want, there won't be
>> any disagreement.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail:
>> > For additional commands, e-mail:
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

View raw message