lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: 7.2.1 cluster dies within minutes after restart
Date Fri, 02 Feb 2018 12:27:00 GMT
Hello Ere,

It appears that my initial e-mail [1] got lost in the thread. We don't have GC issues, the
cluster that dies occasionally runs, in general, smooth and quick with just 2 GB allocated.

Thanks,
Markus

[1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-within-minutes-after-restart-td4372615.html

-----Original message-----
> From:Ere Maijala <ere.maijala@helsinki.fi>
> Sent: Friday 2nd February 2018 8:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> Markus,
> 
> I may be stating the obvious here, but I didn't notice garbage 
> collection mentioned in any of the previous messages, so here goes. In 
> our experience almost all of the Zookeeper timeouts etc. have been 
> caused by too long garbage collection pauses. I've summed up my 
> observations here: 
> <https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html>
> 
> So, in my experience it's relatively easy to cause heavy memory usage 
> with SolrCloud with seemingly innocent queries, and GC can become a 
> problem really quickly even if everything seems to be running smoothly 
> otherwise.
> 
> Regards,
> Ere
> 
> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> > Hello S.G.
> > 
> > We do not complain about speed improvements at all, it is clear 7.x is faster than
its predecessor. The problem is stability and not recovering from weird circumstances. In
general, it is our high load cluster containing user interaction logs that suffers the most.
Our main text search cluster - receiving much fewer queries - seems mostly unaffected, except
last Sunday. After very short but high burst of queries it entered the same catatonic state
the logs cluster usually dies from.
> > 
> > The query burst immediately caused ZK timeouts and high heap consumption (not sure
which came first of the latter two). The query burst lasted for 30 minutes, the excessive
heap consumption continued for more than 8 hours, before Solr finally realized it could relax.
Most remarkable was that Solr recovered on its own, ZK timeouts stopped, heap went back to
normal.
> > 
> > There seems to be a causality between high load and this state.
> > 
> > We really want to get this fixed for ourselves and everyone else that may encounter
this problem, but i don't know how, so i need much more feedback and hints from those who
have deep understanding of inner working of Solrcloud and changes since 6.x.
> > 
> > To be clear, we don't have the problem of 15 second ZK timeout, we use 30. Is 30
too low still? Is it even remotely related to this problem? What does load have to do with
it?
> > 
> > We are not able to reproduce it in lab environments. It can take minutes after cluster
startup for it to occur, but also days.
> > 
> > I've been slightly annoyed by problems that can occur in a board time span, it is
always bad luck for reproduction.
> > 
> > Any help getting further is much appreciated.
> > 
> > Many thanks,
> > Markus
> >   
> > -----Original message-----
> >> From:S G <sg.online.email@gmail.com>
> >> Sent: Wednesday 31st January 2018 21:48
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>
> >> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> >> And that came out all right.
> >> We saw a performance increase of about 30% in read latencies between 6.6.0
> >> and 7.1.0
> >> And then we saw a performance degradation of about 10% between 7.1.0 and
> >> 7.2.1 in many metrics.
> >> But overall, it still seems better than 6.6.0.
> >>
> >> I will check for the errors too in the logs but the nodes were responsive
> >> for all the 23+ hours we did the load test.
> >>
> >> Disclaimer: We do not test facets and pivots or block-joins. And will add
> >> those features to our load-testing tool sometime this year.
> >>
> >> Thanks
> >> SG
> >>
> >>
> >> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <markus.jelsma@openindex.io>
> >> wrote:
> >>
> >>> Ah thanks, i just submitted a patch fixing it.
> >>>
> >>> Anyway, in the end it appears this is not the problem we are seeing as our
> >>> timeouts were already at 30 seconds.
> >>>
> >>> All i know is that at some point nodes start to lose ZK connections due
to
> >>> timeouts (logs say so, but all within 30 seconds), the logs are flooded
> >>> with those messages:
> >>> o.a.z.ClientCnxn Client session timed out, have not heard from server in
> >>> 10359ms for sessionid 0x160f9e723c12122
> >>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> >>> 0x60f9e7234f05bb has expired
> >>>
> >>> Then there is a doubling in heap usage and nodes become unresponsive, die
> >>> etc.
> >>>
> >>> We also see those messages in other collections, but not so frequently and
> >>> they don't cause failure in those less loaded clusters.
> >>>
> >>> Ideas?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>> -----Original message-----
> >>>> From:Michael Braun <n3ca88@gmail.com>
> >>>> Sent: Monday 29th January 2018 21:09
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>
> >>>> Believe this is reported in https://issues.apache.org/
> >>> jira/browse/SOLR-10471
> >>>>
> >>>>
> >>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> >>> markus.jelsma@openindex.io>
> >>>> wrote:
> >>>>
> >>>>> Hello SG,
> >>>>>
> >>>>> The default in solr.in.sh is commented so it defaults to the value
> >>> set in
> >>>>> bin/solr, which is fifteen seconds. Just uncomment the setting in
> >>>>> solr.in.sh and your timeout will be thirty seconds.
> >>>>>
> >>>>> For Solr itself to really default to thirty seconds, Solr's bin/solr
> >>> needs
> >>>>> to be patched to use the correct value.
> >>>>>
> >>>>> Regards,
> >>>>> Markus
> >>>>>
> >>>>> -----Original message-----
> >>>>>> From:S G <sg.online.email@gmail.com>
> >>>>>> Sent: Monday 29th January 2018 20:15
> >>>>>> To: solr-user@lucene.apache.org
> >>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>>
> >>>>>> Hi Markus,
> >>>>>>
> >>>>>> We are in the process of upgrading our clusters to 7.2.1 and
I am not
> >>>>> sure
> >>>>>> I quite follow the conversation here.
> >>>>>> Is there a simple workaround to set the ZK_CLIENT_TIMEOUT to
a higher
> >>>>> value
> >>>>>> in the config (and it's just a default value being wrong/overridden
> >>>>>> somewhere)?
> >>>>>> Or is it more severe in the sense that any config set for
> >>>>> ZK_CLIENT_TIMEOUT
> >>>>>> by the user is just ignored completely by Solr in 7.2.1 ?
> >>>>>>
> >>>>>> Thanks
> >>>>>> SG
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus Jelsma <
> >>>>> markus.jelsma@openindex.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Ok, i applied the patch and it is clear the timeout is 15000.
> >>> Solr.xml
> >>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not set, which is by
default
> >>> unset
> >>>>> in
> >>>>>>> solr.in.sh,but set in bin/solr to 15000. So it seems Solr's
> >>> default is
> >>>>>>> still 15000, not 30000.
> >>>>>>>
> >>>>>>> But, back to my topic. I see we explicitly set it in solr.in.sh
to
> >>>>> 30000.
> >>>>>>> To be sure, i applied your patch to a production machine,
all our
> >>>>>>> collections run with 30000. So how would that explain this
log
> >>> line?
> >>>>>>>
> >>>>>>> o.a.z.ClientCnxn Client session timed out, have not heard
from
> >>> server
> >>>>> in
> >>>>>>> 22130ms
> >>>>>>>
> >>>>>>> We also see these with smaller values, seven seconds. And,
is this
> >>>>>>> actually an indicator of the problems we have?
> >>>>>>>
> >>>>>>> Any ideas?
> >>>>>>>
> >>>>>>> Many thanks,
> >>>>>>> Markus
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original message-----
> >>>>>>>> From:Markus Jelsma <markus.jelsma@openindex.io>
> >>>>>>>> Sent: Saturday 27th January 2018 10:03
> >>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>> Subject: RE: 7.2.1 cluster dies within minutes after
restart
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I grepped for it yesterday and found nothing but 30000
in the
> >>>>> settings,
> >>>>>>> but judging from the weird time out value, you may be right.
Let me
> >>>>> apply
> >>>>>>> your patch early next week and check for spurious warnings.
> >>>>>>>>
> >>>>>>>> Another note worthy observation for those working on
cloud
> >>> stability
> >>>>> and
> >>>>>>> recovery, whenever this happens, some nodes are also absolutely
> >>> sure
> >>>>> to run
> >>>>>>> OOM. The leaders usually live longest, the replica's don't,
their
> >>> heap
> >>>>>>> usage peaks every time, consistently.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Markus
> >>>>>>>>
> >>>>>>>> -----Original message-----
> >>>>>>>>> From:Shawn Heisey <apache@elyograg.org>
> >>>>>>>>> Sent: Saturday 27th January 2018 0:49
> >>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after
restart
> >>>>>>>>>
> >>>>>>>>> On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> >>>>>>>>>> o.a.z.ClientCnxn Client session timed out, have
not heard
> >>> from
> >>>>>>> server in 22130ms (although zkClientTimeOut is 30000).
> >>>>>>>>>
> >>>>>>>>> Are you absolutely certain that there is a setting
for
> >>>>> zkClientTimeout
> >>>>>>>>> that is actually getting applied?  The default value
in Solr's
> >>>>> example
> >>>>>>>>> configs is 30 seconds, but the internal default
in the code
> >>> (when
> >>>>> no
> >>>>>>>>> configuration is found) is still 15.  I have confirmed
this in
> >>> the
> >>>>>>> code.
> >>>>>>>>>
> >>>>>>>>> Looks like SolrCloud doesn't log the values it's
using for
> >>> things
> >>>>> like
> >>>>>>>>> zkClientTimeout.  I think it should.
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Shawn
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
> 

Mime
View raw message