lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerven Tjalling Bolleman <Jerven.Bolle...@sib.swiss>
Subject Re: Static index, fastest way to do forceMerge
Date Fri, 02 Nov 2018 19:30:27 GMT
Hi Dawid, Erick,

Thanks for the reply. We are using pure lucene and currently this is 
what I am doing

int processors = Runtime.getRuntime().availableProcessors();
int ConcurrentMergeScheduler cms = new ConcurrentMergeScheduler();
cms.setMaxMergesAndThreads(processors,processors);
cms.disableAutoIOThrottle();

config.setMergeScheduler(cms);

TieredMergePolicy tms = new TieredMergePolicy();
int merges = Math.max(2, processors);
tms.setMaxMergeAtOnce(merges);
tms.setMaxMergeAtOnceExplicit(merges);
tms.setSegmentsPerTier(merges * 2);
config.setMergePolicy(new TieredMergePolicy());
IndexWriter writer = new IndexWriter(this.dir, config);
writer.forceMerge(1, true);

Regarding, seeing a performance decrease without merging. Yes 
significant.
That the one time we tried we broke every SLA and haven't tried in long 
time.
Of course now our largest index is more than 200 million documents so 
perhaps we
should retry this.

   38G _583u.fdt
   25M _583u.fdx
   13K _583u.fnm
   47G _583u_Lucene50_0.doc
   54G _583u_Lucene50_0.pos
   30G _583u_Lucene50_0.tim
  413M _583u_Lucene50_0.tip
  2.1G _583u_Lucene70_0.dvd
   213 _583u_Lucene70_0.dvm

If we did such a max resource merge code would there be interest to have 
this merged?

Or should we maybe do something like this assuming 64 cpus

writer.forceMerge(64, true);
writer.forceMerge(32, true);
writer.forceMerge(16, true);
writer.forceMerge(8, true);
writer.forceMerge(4, true);
writer.forceMerge(2, true);
writer.forceMerge(1, true);

Regards,
Jerven


On 2018-11-02 19:19, Dawid Weiss wrote:
> We are faced with a similar situation. Yes, the merge process can take
> a long time and is mostly single-threaded (if you're merging from N
> segments into a single segment, only one thread does the job). As
> Erick pointed out, the merge process takes a backseat compared to
> indexing and searches (in most cases), so it's not a priority, but
> it's definitely something people like you (and me) could utilize, if
> given the opportunity.
> 
> I actually don't see any reasons why merging of individual parts of a
> segment can't be done in parallel (this would be a start; later on a
> splittable strategy of merging single things could make use of things
> like the fork-join executor). I'd love to work on this at some point,
> but I honestly don't see any time soon this could be happening. If you
> have a spare cycle, take a look at how index writer merges a single
> segment; there are quite trivial ways this could be split into
> parallel subtasks and executed with, for example, the system fork-join
> executor (even without forkable tasks).
> 
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2999-L3007
> 
> As a side note, you may want to make absolutely sure your merge
> scheduler (if it's the CMS) is not using any I/O throttling -- this is
> theoretically self-adjustable, but in practice if you only care about
> the wall-clock end of a single merge, it's better to turn it off.
> 
> Dawid
> On Fri, Nov 2, 2018 at 5:57 PM Erick Erickson <erickerickson@gmail.com> 
> wrote:
>> 
>> The merge process is rather tricky, and there's nothing that I know of
>> that will use all resources available. In fact the merge code is
>> written to _not_ use up all the possible resources on the theory that
>> there should be some left over to handle queries etc.
>> 
>> Yeah, the situation you describe is indeed one of the few where
>> merging down to 1 segment makes sense. Out of curiosity, what kind of
>> performance gains to you see?
>> 
>> This applies to the default TieredMergePolicy (TMP):
>> 
>> 1> there is a limit to the number of segments that can be merged at
>> once, so sometimes it can take more than one pass. If you have more
>> than 30 segments, it'll be multi-pass. You can try (and I haven't done
>> this personally) setting maxMergeAtOnceExplicit in your solrconfig.xml
>> to see if it helps. That only takes effect when you forceMerge.
>> There's a trick bit of reflection that handles this, see the very end
>> of TieredMergePolicy.java for the parameters you can set.
>> 
>> 2> As of Solr 7.5 (see LUCENE-7976) the default behavior has changed
>> from automatically merging down to 1 segment to respecting
>> "maxMergedSegmentMB" (default 5G). You will have to explicitly pass
>> maxSegments=1 to get the old behavior.
>> 
>> Best,
>> Erick
>> On Fri, Nov 2, 2018 at 3:13 AM Jerven Bolleman
>> <jerven.bolleman@sib.swiss> wrote:
>> >
>> > Dear Lucene Devs and Users,
>> >
>> > First of all thank you for this wonderful library and API.
>> >
>> > forceMerges are normally not recommended but we fall into one of the few
>> > usecases where it makes sense.
>> >
>> > In our use case we have a large index (3 actually) and we don't update
>> > them ever after indexing. i.e. we index all the documents and then never
>> > ever add another document to the index, nor are any deleted.
>> >
>> > It has proven beneficial for search performance to always foreMerge down
>> > to one segment. However, this takes significant time. Are there any
>> > suggestions on what kind of merge scheduler/policy settings will utilize
>> > the most of the available IO, CPU and RAM capacity? Currently we end up
>> > being single thread bound, leaving lots of potential cpu and bandwidth
>> > not used during the merge.
>> >
>> > e.g. we are looking for a MergeEvertyThing use all hardware policy and
>> > scheduler.
>> >
>> > We are currently on lucene 7.4 but nothing is stopping us from upgrading.
>> >
>> > Regards,
>> > Jerven
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

-- 
Jerven Tjalling Bolleman
SIB | Swiss Institute of Bioinformatics
CMU - 1, rue Michel Servet - 1211 Geneva 4
t: +41 22 379 58 85 - f: +41 22 379 58 58
Jerven.Bolleman@sib.swiss - http://www.sib.swiss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message