lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: Index optimize runs in background.
Date Thu, 11 Jun 2015 06:52:35 GMT
Until somewhere around Lucene 3.5, you needed to optimise, because the
merge strategy used wasn't that clever and left lots of deletes in your
largest segment. Around that point, the TieredMergePolicy became the
default. Because its algorithm is much more sophisticated, it took away
the need to optimize in the majority of scenarios. In fact, it
transformed optimizing from being a necessary thing to being a "bad"
thing in most cases.

So yes, let the algorithm take care of it, so long as you are using the
TieredMergePolicy, which has been the default for over 2 years.

Upayavira

On Thu, Jun 11, 2015, at 07:01 AM, Walter Underwood wrote:
> Why would you care when the forced merge (not an “optimize”) is done?
> Start it and get back to work.
> 
> Or even better, never force merge and let the algorithm take care of it.
> Seriously, I’ve been giving this advice since before Lucene was written,
> because Ultraseek had the same approach for managing index segments.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> On Jun 10, 2015, at 10:35 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> 
> > If I knew, I would fix it ;). The sub-optimizes (i.e. the ones
> > sent out to each replica) should be sent in parallel and then
> > each thread should wait for completion from the replicas. There
> > is no real "check for optimize", I believe that the return from the
> > call is considered sufficient. If we can track down if there are
> > conditions under which this is not true we can fix it.
> > 
> > But until there's a way to reproduce it, it's pretty much speculation.
> > 
> > Best,
> > Erick
> > 
> > On Wed, Jun 10, 2015 at 10:14 PM, Modassar Ather <modather1981@gmail.com>
wrote:
> >> Hi,
> >> 
> >> There are 5 cores and a separate server for indexing on this solrcloud. Can
> >> you please share your suggestions on:
> >>  How can indexer know that the optimize has completed even if the
> >> commit/optimize runs in background without going to the solr servers may be
> >> by using any solrj or other API?
> >> 
> >> I tried but could not find any API/handler to check if the optimizations is
> >> completed. Kindly share your inputs.
> >> 
> >> Thanks,
> >> Modassar
> >> 
> >> On Thu, Jun 4, 2015 at 9:36 PM, Erick Erickson <erickerickson@gmail.com>
> >> wrote:
> >> 
> >>> Can't get any failures to happen on my end so I really haven't a clue.
> >>> 
> >>> Best,
> >>> Erick
> >>> 
> >>> On Thu, Jun 4, 2015 at 3:17 AM, Modassar Ather <modather1981@gmail.com>
> >>> wrote:
> >>>> Hi,
> >>>> 
> >>>> Please provide your inputs on optimize and commit running as background.
> >>>> Your suggestion will be really helpful.
> >>>> 
> >>>> Thanks,
> >>>> Modassar
> >>>> 
> >>>> On Tue, Jun 2, 2015 at 6:05 PM, Modassar Ather <modather1981@gmail.com>
> >>>> wrote:
> >>>> 
> >>>>> Erick! I could not find any underlying setting of 10 minutes.
> >>>>> It is not only optimize but commit is also behaving in the same
fashion
> >>>>> and is taking lesser time than usually had taken.
> >>>>> As per my observation both are running in background.
> >>>>> 
> >>>>> On Fri, May 29, 2015 at 7:21 PM, Erick Erickson <
> >>> erickerickson@gmail.com>
> >>>>> wrote:
> >>>>> 
> >>>>>> I'm not talking about you setting a timeout, but the underlying
> >>>>>> connection timing out...
> >>>>>> 
> >>>>>> The "10 minutes then the indexer exits" comment points in that
> >>> direction.
> >>>>>> 
> >>>>>> Best,
> >>>>>> Erick
> >>>>>> 
> >>>>>> On Thu, May 28, 2015 at 11:43 PM, Modassar Ather <
> >>> modather1981@gmail.com>
> >>>>>> wrote:
> >>>>>>> I have not added any timeout in the indexer except zk client
time out
> >>>>>> which
> >>>>>>> is 30 seconds. I am simply calling client.close() at the
end of
> >>>>>> indexing.
> >>>>>>> The same code was not running in background for optimize
with
> >>>>>> solr-4.10.3
> >>>>>>> and org.apache.solr.client.solrj.impl.CloudSolrServer.
> >>>>>>> 
> >>>>>>> On Fri, May 29, 2015 at 11:13 AM, Erick Erickson <
> >>>>>> erickerickson@gmail.com>
> >>>>>>> wrote:
> >>>>>>> 
> >>>>>>>> Are you timing out on the client request? The theory
here is that
> >>> it's
> >>>>>>>> still a synchronous call, but you're just timing out
at the client
> >>>>>>>> level. At that point, the optimize is still running
it's just the
> >>>>>>>> connection has been dropped....
> >>>>>>>> 
> >>>>>>>> Shot in the dark.
> >>>>>>>> Erick
> >>>>>>>> 
> >>>>>>>> On Thu, May 28, 2015 at 10:31 PM, Modassar Ather <
> >>>>>> modather1981@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> I could not notice it but with my past experience
of commit which
> >>>>>> used to
> >>>>>>>>> take around 2 minutes is now taking around 8 seconds.
I think
> >>> this is
> >>>>>>>> also
> >>>>>>>>> running as background.
> >>>>>>>>> 
> >>>>>>>>> On Fri, May 29, 2015 at 10:52 AM, Modassar Ather
<
> >>>>>> modather1981@gmail.com
> >>>>>>>>> 
> >>>>>>>>> wrote:
> >>>>>>>>> 
> >>>>>>>>>> The indexer takes almost 2 hours to optimize.
It has a
> >>>>>> multi-threaded
> >>>>>>>> add
> >>>>>>>>>> of batches of documents to
> >>>>>>>>>> org.apache.solr.client.solrj.impl.CloudSolrClient.
> >>>>>>>>>> Once all the documents are indexed it invokes
commit and
> >>> optimize. I
> >>>>>>>> have
> >>>>>>>>>> seen that the optimize goes into background
after 10 minutes and
> >>>>>> indexer
> >>>>>>>>>> exits.
> >>>>>>>>>> I am not sure why this 10 minutes it hangs on
indexer. This
> >>>>>> behavior I
> >>>>>>>>>> have seen in multiple iteration of the indexing
of same data.
> >>>>>>>>>> 
> >>>>>>>>>> There is nothing significant I found in log
which I can share. I
> >>>>>> can see
> >>>>>>>>>> following in log.
> >>>>>>>>>> org.apache.solr.update.DirectUpdateHandler2;
start
> >>>>>>>>>> 
> >>>>>>>> 
> >>>>>> 
> >>> commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> >>>>>>>>>> 
> >>>>>>>>>> On Wed, May 27, 2015 at 10:59 PM, Erick Erickson
<
> >>>>>>>> erickerickson@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> All strange of course. What do your Solr
logs show when this
> >>>>>> happens?
> >>>>>>>>>>> And how reproducible is this?
> >>>>>>>>>>> 
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Erick
> >>>>>>>>>>> 
> >>>>>>>>>>> On Wed, May 27, 2015 at 4:00 AM, Upayavira
<uv@odoko.co.uk>
> >>> wrote:
> >>>>>>>>>>>> In this case, optimising makes sense,
once the index is
> >>>>>> generated,
> >>>>>>>> you
> >>>>>>>>>>>> are not updating It.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Upayavira
> >>>>>>>>>>>> 
> >>>>>>>>>>>> On Wed, May 27, 2015, at 06:14 AM, Modassar
Ather wrote:
> >>>>>>>>>>>>> Our index has almost 100M documents
running on SolrCloud of 5
> >>>>>> shards
> >>>>>>>>>>> and
> >>>>>>>>>>>>> each shard has an index size of
about 170+GB (for the record,
> >>>>>> we are
> >>>>>>>>>>> not
> >>>>>>>>>>>>> using stored fields - our documents
are pretty large). We
> >>>>>> perform a
> >>>>>>>>>>> full
> >>>>>>>>>>>>> indexing every weekend and during
the week there are no
> >>> updates
> >>>>>>>> made to
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> index. Most of the queries that
we run are pretty complex
> >>> with
> >>>>>>>> hundreds
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>> terms using PhraseQuery, BooleanQuery,
SpanQuery, Wildcards,
> >>>>>> boosts
> >>>>>>>>>>> etc.
> >>>>>>>>>>>>> and take many minutes to execute.
A difference of 10-20% is
> >>>>>> also a
> >>>>>>>> big
> >>>>>>>>>>>>> advantage for us.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> We have been optimizing the index
after indexing for years
> >>> and
> >>>>>> it
> >>>>>>>> has
> >>>>>>>>>>>>> worked well for us. Every once in
a while, we upgrade Solr to
> >>>>>> the
> >>>>>>>>>>> latest
> >>>>>>>>>>>>> version and try without optimizing
so that we can save the
> >>> many
> >>>>>>>> hours
> >>>>>>>>>>> it
> >>>>>>>>>>>>> take to optimize such a huge index,
but find optimized index
> >>>>>> work
> >>>>>>>> well
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>> us.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Erick I was indexing today the documents
and saw the optimize
> >>>>>>>> happening
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>> background.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On Tue, May 26, 2015 at 9:12 PM,
Erick Erickson <
> >>>>>>>>>>> erickerickson@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> No results yet. I finished the
test harness last night (not
> >>>>>>>> really a
> >>>>>>>>>>>>>> unit test, a stand-alone program
that endlessly adds stuff
> >>> and
> >>>>>>>> tests
> >>>>>>>>>>>>>> that every commit returns the
correct number of docs).
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 8,000 cycles later there aren't
any problems reported.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Siiigggggh.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Tue, May 26, 2015 at 1:51
AM, Modassar Ather <
> >>>>>>>>>>> modather1981@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Erick you mentioned about
a unit test to test the
> >>> optimize
> >>>>>>>> running
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>> background. Kindly share
your findings if any.
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Modassar
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> On Mon, May 25, 2015 at
11:47 AM, Modassar Ather <
> >>>>>>>>>>> modather1981@gmail.com
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks everybody for
your replies.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> I have noticed the optimization
running in background
> >>> every
> >>>>>>>> time I
> >>>>>>>>>>>>>>>> indexed. This is 5 node
cluster with solr-5.1.0 and uses
> >>>>>> the
> >>>>>>>>>>>>>>>> CloudSolrClient. Kindly
share your findings on this
> >>> issue.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Our index has almost
100M documents running on
> >>> SolrCloud.
> >>>>>> We
> >>>>>>>> have
> >>>>>>>>>>> been
> >>>>>>>>>>>>>>>> optimizing the index
after indexing for years and it has
> >>>>>> worked
> >>>>>>>>>>> well for
> >>>>>>>>>>>>>>>> us.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Modassar
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> On Fri, May 22, 2015
at 11:55 PM, Erick Erickson <
> >>>>>>>>>>>>>> erickerickson@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Actually, I've recently
seen very similar behavior in
> >>> Solr
> >>>>>>>>>>> 4.10.3, but
> >>>>>>>>>>>>>>>>> involving hard commits
openSearcher=true, see:
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-7572.
Of
> >>>>>> course I
> >>>>>>>>>>> can't
> >>>>>>>>>>>>>>>>> reproduce this at
will, siigggghhhh.
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> A unit test should
be very simple to write though,
> >>> maybe
> >>>>>> I can
> >>>>>>>>>>> get to
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> today.
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> Erick
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> On Fri, May 22,
2015 at 8:27 AM, Upayavira <
> >>>>>> uv@odoko.co.uk>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> On Fri, May
22, 2015, at 03:55 PM, Shawn Heisey
> >>> wrote:
> >>>>>>>>>>>>>>>>>>> On 5/21/2015
6:21 AM, Modassar Ather wrote:
> >>>>>>>>>>>>>>>>>>>> I am
using Solr-5.1.0. I have an indexer class
> >>> which
> >>>>>>>> invokes
> >>>>>>>>>>>>>>>>>>>> cloudSolrClient.optimize(true,
true, 1). My
> >>> indexer
> >>>>>> exits
> >>>>>>>>>>> after
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> invocation
of optimize and the optimization keeps
> >>> on
> >>>>>>>>>>> running in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> background.
> >>>>>>>>>>>>>>>>>>>> Kindly
let me know if it is per design and how
> >>> can I
> >>>>>>>> make my
> >>>>>>>>>>>>>> indexer
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> wait
until the optimization is over. Is there a
> >>>>>>>>>>>>>>>>> configuration/parameter
I
> >>>>>>>>>>>>>>>>>>>> need
to set for the same.
> >>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>> Please
note that the same indexer with
> >>>>>>>>>>>>>>>>> cloudSolrServer.optimize(true,
true,
> >>>>>>>>>>>>>>>>>>>> 1) on
Solr-4.10 used to wait till the optimize was
> >>>>>> over
> >>>>>>>>>>> before
> >>>>>>>>>>>>>>>>> exiting.
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> This is
very odd, because I could not get
> >>>>>> HttpSolrServer to
> >>>>>>>>>>>>>> optimize in
> >>>>>>>>>>>>>>>>>>> the background,
even when that was what I wanted.
> >>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>> I wondered
if maybe the Cloud object behaves
> >>>>>> differently
> >>>>>>>> with
> >>>>>>>>>>>>>> regard to
> >>>>>>>>>>>>>>>>>>> blocking
until an optimize is finished ... except
> >>> that
> >>>>>>>> there
> >>>>>>>>>>> is no
> >>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>> for optimizing
in CloudSolrClient at all ... so I
> >>> don't
> >>>>>>>> know
> >>>>>>>>>>> where
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> different
behavior would actually be happening.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> A more important
question is, why are you optimising?
> >>>>>>>>>>> Generally it
> >>>>>>>>>>>>>> isn't
> >>>>>>>>>>>>>>>>>> recommended
anymore as it reduces the natural
> >>>>>> distribution
> >>>>>>>> of
> >>>>>>>>>>>>>> documents
> >>>>>>>>>>>>>>>>>> amongst segments
and makes future merges more costly.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> Upayavira
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> 
> >>> 
> 

Mime
View raw message