lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Index size increases disproportionately to size of added field when indexed=false
Date Tue, 13 Feb 2018 16:05:52 GMT
David:

Right, Optimize Is Evil. Well, actually in your case it's not. In your
specific case you can optimize every time you build your index and be
OK, gory details here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

But that's just for background. The key is how many deleted docs you
have, which you can see from the admin UI screen. If you have 0
deleted docs, you have 0 space that would be reclaimed by an optimize.
My bet is that you have no deleted docs, if so just forget the whole
optimize question as it's a red herring.

"...storage increase would be approximately 200,000 * 19 = 3.8M bytes
= 3.6MB rather than the 7.5GB..."

Actually I'd expect it to only be half that  (1.9M). Stored fields are
compressed on disk and we usually see about a 2:1 compression ratio.
There'll be a little bit of fudge for metadata, but not enough to
measure probably.

So yes, this is totally weird. I think you'll also find that docValues
is set to true by default. This _still_ shouldn't be adding that much
to this index, but if you turn docValues off for that field what
happens?

Stored data is held in your *.fdt and *.fdx files. what's the total
index space used in your index by these two extensions with and
without your field?

*.dvd files contain the docValues data, again what's the before/after
size of all these files with and without your field?

These are two specific places to look, but in general I'm asking what
the total size is by extension in your index directory with and
without your field on the guess that one extension will be massively
bigger, this is totally surprising, but it'd give us a clue where to
look.

Here are the file extensions and what they contain BTW:
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html

Best,
Erick

On Tue, Feb 13, 2018 at 3:41 AM, Alessandro Benedetti
<a.benedetti@sease.io> wrote:
> Hi David,
> given the fact that you are actually building a new index from scratch, my
> shot in the dark didn't hit any target.
> When you say  : "Once the import finishes we save the docker image in the
> AWS docker repository.  We then build our cluster using that image as the
> base"
>
> Do you mean just configuraiton wise ?
> Will the new cluster have any starting index on disk?
> If i understood correctly your latest statements I expect a NO in here.
>
> So you are building a completely new index and comparing to the old index (
> which is completely separate) you denote such a big difference in size.
> This is extremely suspicious .
> Optimizing in the end is just a huge merge to force 1 ( or N) final
> segments.
> Given the additional information you gave me, it's not going to make much
> difference.
>
> I would recommend to check how the index space is divided in different file
> formats [1]
> ( i.e. list how much space is dedicated to a specific extension)
>
> Stored content is in the .fdt files.
>
>
> [1]
> https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Mime
View raw message