lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S G <sg.online.em...@gmail.com>
Subject Re: 7.2.1 cluster dies within minutes after restart
Date Fri, 02 Feb 2018 21:19:17 GMT
Our 3.4.6 ZK nodes were unable to join the cluster unless their quorum got
broken.
So if there was 5 node zookeeper and it lost 2 nodes, they would not rejoin
because ZK still had its quorum.
To make them join, you had to break the quorum by restarting a node in
quorum.
Only when quorum broke, did ZK realize that something was wrong and it
recognized the other two nodes trying to rejoin.
Also this problem happened only when ZK had been running for a long time,
like several weeks (perhaps DNS caching or something, not sure really).


On Fri, Feb 2, 2018 at 11:32 AM, Tomas Fernandez Lobbe <tflobbe@apple.com>
wrote:

> Hi Markus,
> If the same code that runs OK in 7.1 breaks 7.2.1, it is clear to me that
> there is some bug in Solr introduced between those releases (maybe an
> increase in memory utilization? or maybe some decrease in query throughput
> making threads to pile up?). I’d hate to have this issue lost in the users
> list, could you create a Jira? Maybe next time you have this issue you can
> post thread/heap dumps, that would be useful.
>
> Tomás
>
> > On Feb 2, 2018, at 9:38 AM, Walter Underwood <wunder@wunderwood.org>
> wrote:
> >
> > Zookeeper 3.4.6 is not good? That was the version recommended by Solr
> docs when I installed 6.2.0.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Feb 2, 2018, at 9:30 AM, Markus Jelsma <markus.jelsma@openindex.io>
> wrote:
> >>
> >> Hello S.G.
> >>
> >> We have relied in Trie* fields every since they became available, i
> don't think reverting to the old fieldType's will do us any good, we have a
> very recent problem.
> >>
> >> Regarding our heap, the cluster ran fine for years with just 1.5 GB, we
> only recently increased it because or data keeps on growing. Heap rarely
> goes higher than 50 %, except when this specific problem occurs. The nodes
> have no problem processing a few hundred QPS continuously and can go on for
> days, sometimes even a few weeks.
> >>
> >> I will keep my eye open for other clues when the problem strikes again!
> >>
> >> Thanks,
> >> Markus
> >>
> >> -----Original message-----
> >>> From:S G <sg.online.email@gmail.com>
> >>> Sent: Friday 2nd February 2018 18:20
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>
> >>> Yeah, definitely check the zookeeper version.
> >>> 3.4.6 is not a good one I know and you can say the same for all the
> >>> versions below it too.
> >>> We have used 3.4.9 with no issues.
> >>> While Solr 7.x uses 3.4.10
> >>>
> >>> Another dimension could be the use or (dis-use) of p-fields like pint,
> >>> plong etc.
> >>> If you are using them, try to revert back to tint, tlong etc
> >>> And if you are not using them, try to use them (Although doing this
> means a
> >>> change from your older config and less likely to help).
> >>>
> >>> Lastly, did I read 2 GB for JVM heap?
> >>> That seems really too less to me for any version of Solr
> >>> We run with 10-16 gb of heap with G1GC collector and new-gen capped at
> 3-4gb
> >>>
> >>>
> >>> On Fri, Feb 2, 2018 at 4:27 AM, Markus Jelsma <
> markus.jelsma@openindex.io>
> >>> wrote:
> >>>
> >>>> Hello Ere,
> >>>>
> >>>> It appears that my initial e-mail [1] got lost in the thread. We don't
> >>>> have GC issues, the cluster that dies occasionally runs, in general,
> smooth
> >>>> and quick with just 2 GB allocated.
> >>>>
> >>>> Thanks,
> >>>> Markus
> >>>>
> >>>> [1]: http://lucene.472066.n3.nabble.com/7-2-1-cluster-dies-
> >>>> within-minutes-after-restart-td4372615.html
> >>>>
> >>>> -----Original message-----
> >>>>> From:Ere Maijala <ere.maijala@helsinki.fi>
> >>>>> Sent: Friday 2nd February 2018 8:49
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>
> >>>>> Markus,
> >>>>>
> >>>>> I may be stating the obvious here, but I didn't notice garbage
> >>>>> collection mentioned in any of the previous messages, so here goes.
> In
> >>>>> our experience almost all of the Zookeeper timeouts etc. have been
> >>>>> caused by too long garbage collection pauses. I've summed up my
> >>>>> observations here:
> >>>>> <https://www.mail-archive.com/solr-user@lucene.apache.org/
> msg135857.html
> >>>>>
> >>>>>
> >>>>> So, in my experience it's relatively easy to cause heavy memory
usage
> >>>>> with SolrCloud with seemingly innocent queries, and GC can become
a
> >>>>> problem really quickly even if everything seems to be running
> smoothly
> >>>>> otherwise.
> >>>>>
> >>>>> Regards,
> >>>>> Ere
> >>>>>
> >>>>> Markus Jelsma kirjoitti 31.1.2018 klo 23.56:
> >>>>>> Hello S.G.
> >>>>>>
> >>>>>> We do not complain about speed improvements at all, it is clear
7.x
> is
> >>>> faster than its predecessor. The problem is stability and not
> recovering
> >>>> from weird circumstances. In general, it is our high load cluster
> >>>> containing user interaction logs that suffers the most. Our main text
> >>>> search cluster - receiving much fewer queries - seems mostly
> unaffected,
> >>>> except last Sunday. After very short but high burst of queries it
> entered
> >>>> the same catatonic state the logs cluster usually dies from.
> >>>>>>
> >>>>>> The query burst immediately caused ZK timeouts and high heap
> >>>> consumption (not sure which came first of the latter two). The query
> burst
> >>>> lasted for 30 minutes, the excessive heap consumption continued for
> more
> >>>> than 8 hours, before Solr finally realized it could relax. Most
> remarkable
> >>>> was that Solr recovered on its own, ZK timeouts stopped, heap went
> back to
> >>>> normal.
> >>>>>>
> >>>>>> There seems to be a causality between high load and this state.
> >>>>>>
> >>>>>> We really want to get this fixed for ourselves and everyone
else
> that
> >>>> may encounter this problem, but i don't know how, so i need much more
> >>>> feedback and hints from those who have deep understanding of inner
> working
> >>>> of Solrcloud and changes since 6.x.
> >>>>>>
> >>>>>> To be clear, we don't have the problem of 15 second ZK timeout,
we
> use
> >>>> 30. Is 30 too low still? Is it even remotely related to this problem?
> What
> >>>> does load have to do with it?
> >>>>>>
> >>>>>> We are not able to reproduce it in lab environments. It can
take
> >>>> minutes after cluster startup for it to occur, but also days.
> >>>>>>
> >>>>>> I've been slightly annoyed by problems that can occur in a board
> time
> >>>> span, it is always bad luck for reproduction.
> >>>>>>
> >>>>>> Any help getting further is much appreciated.
> >>>>>>
> >>>>>> Many thanks,
> >>>>>> Markus
> >>>>>>
> >>>>>> -----Original message-----
> >>>>>>> From:S G <sg.online.email@gmail.com>
> >>>>>>> Sent: Wednesday 31st January 2018 21:48
> >>>>>>> To: solr-user@lucene.apache.org
> >>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after restart
> >>>>>>>
> >>>>>>> We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
> >>>>>>> And that came out all right.
> >>>>>>> We saw a performance increase of about 30% in read latencies
> between
> >>>> 6.6.0
> >>>>>>> and 7.1.0
> >>>>>>> And then we saw a performance degradation of about 10% between
> 7.1.0
> >>>> and
> >>>>>>> 7.2.1 in many metrics.
> >>>>>>> But overall, it still seems better than 6.6.0.
> >>>>>>>
> >>>>>>> I will check for the errors too in the logs but the nodes
were
> >>>> responsive
> >>>>>>> for all the 23+ hours we did the load test.
> >>>>>>>
> >>>>>>> Disclaimer: We do not test facets and pivots or block-joins.
And
> will
> >>>> add
> >>>>>>> those features to our load-testing tool sometime this year.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> SG
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jan 31, 2018 at 3:12 AM, Markus Jelsma <
> >>>> markus.jelsma@openindex.io>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Ah thanks, i just submitted a patch fixing it.
> >>>>>>>>
> >>>>>>>> Anyway, in the end it appears this is not the problem
we are
> seeing
> >>>> as our
> >>>>>>>> timeouts were already at 30 seconds.
> >>>>>>>>
> >>>>>>>> All i know is that at some point nodes start to lose
ZK
> connections
> >>>> due to
> >>>>>>>> timeouts (logs say so, but all within 30 seconds), the
logs are
> >>>> flooded
> >>>>>>>> with those messages:
> >>>>>>>> o.a.z.ClientCnxn Client session timed out, have not
heard from
> >>>> server in
> >>>>>>>> 10359ms for sessionid 0x160f9e723c12122
> >>>>>>>> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service,
session
> >>>>>>>> 0x60f9e7234f05bb has expired
> >>>>>>>>
> >>>>>>>> Then there is a doubling in heap usage and nodes become
> >>>> unresponsive, die
> >>>>>>>> etc.
> >>>>>>>>
> >>>>>>>> We also see those messages in other collections, but
not so
> >>>> frequently and
> >>>>>>>> they don't cause failure in those less loaded clusters.
> >>>>>>>>
> >>>>>>>> Ideas?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Markus
> >>>>>>>>
> >>>>>>>> -----Original message-----
> >>>>>>>>> From:Michael Braun <n3ca88@gmail.com>
> >>>>>>>>> Sent: Monday 29th January 2018 21:09
> >>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes after
restart
> >>>>>>>>>
> >>>>>>>>> Believe this is reported in https://issues.apache.org/
> >>>>>>>> jira/browse/SOLR-10471
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jan 29, 2018 at 2:55 PM, Markus Jelsma <
> >>>>>>>> markus.jelsma@openindex.io>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello SG,
> >>>>>>>>>>
> >>>>>>>>>> The default in solr.in.sh is commented so it
defaults to the
> value
> >>>>>>>> set in
> >>>>>>>>>> bin/solr, which is fifteen seconds. Just uncomment
the setting
> in
> >>>>>>>>>> solr.in.sh and your timeout will be thirty seconds.
> >>>>>>>>>>
> >>>>>>>>>> For Solr itself to really default to thirty
seconds, Solr's
> >>>> bin/solr
> >>>>>>>> needs
> >>>>>>>>>> to be patched to use the correct value.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Markus
> >>>>>>>>>>
> >>>>>>>>>> -----Original message-----
> >>>>>>>>>>> From:S G <sg.online.email@gmail.com>
> >>>>>>>>>>> Sent: Monday 29th January 2018 20:15
> >>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>> Subject: Re: 7.2.1 cluster dies within minutes
after restart
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Markus,
> >>>>>>>>>>>
> >>>>>>>>>>> We are in the process of upgrading our clusters
to 7.2.1 and I
> am
> >>>> not
> >>>>>>>>>> sure
> >>>>>>>>>>> I quite follow the conversation here.
> >>>>>>>>>>> Is there a simple workaround to set the
ZK_CLIENT_TIMEOUT to a
> >>>> higher
> >>>>>>>>>> value
> >>>>>>>>>>> in the config (and it's just a default value
being
> >>>> wrong/overridden
> >>>>>>>>>>> somewhere)?
> >>>>>>>>>>> Or is it more severe in the sense that any
config set for
> >>>>>>>>>> ZK_CLIENT_TIMEOUT
> >>>>>>>>>>> by the user is just ignored completely by
Solr in 7.2.1 ?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> SG
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Jan 29, 2018 at 3:09 AM, Markus
Jelsma <
> >>>>>>>>>> markus.jelsma@openindex.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Ok, i applied the patch and it is clear
the timeout is 15000.
> >>>>>>>> Solr.xml
> >>>>>>>>>>>> says 30000 if ZK_CLIENT_TIMEOUT is not
set, which is by
> default
> >>>>>>>> unset
> >>>>>>>>>> in
> >>>>>>>>>>>> solr.in.sh,but set in bin/solr to 15000.
So it seems Solr's
> >>>>>>>> default is
> >>>>>>>>>>>> still 15000, not 30000.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But, back to my topic. I see we explicitly
set it in
> solr.in.sh
> >>>> to
> >>>>>>>>>> 30000.
> >>>>>>>>>>>> To be sure, i applied your patch to
a production machine, all
> our
> >>>>>>>>>>>> collections run with 30000. So how would
that explain this log
> >>>>>>>> line?
> >>>>>>>>>>>>
> >>>>>>>>>>>> o.a.z.ClientCnxn Client session timed
out, have not heard from
> >>>>>>>> server
> >>>>>>>>>> in
> >>>>>>>>>>>> 22130ms
> >>>>>>>>>>>>
> >>>>>>>>>>>> We also see these with smaller values,
seven seconds. And, is
> >>>> this
> >>>>>>>>>>>> actually an indicator of the problems
we have?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any ideas?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Many thanks,
> >>>>>>>>>>>> Markus
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>> From:Markus Jelsma <markus.jelsma@openindex.io>
> >>>>>>>>>>>>> Sent: Saturday 27th January 2018
10:03
> >>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>> Subject: RE: 7.2.1 cluster dies
within minutes after restart
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I grepped for it yesterday and found
nothing but 30000 in the
> >>>>>>>>>> settings,
> >>>>>>>>>>>> but judging from the weird time out
value, you may be right.
> Let
> >>>> me
> >>>>>>>>>> apply
> >>>>>>>>>>>> your patch early next week and check
for spurious warnings.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Another note worthy observation
for those working on cloud
> >>>>>>>> stability
> >>>>>>>>>> and
> >>>>>>>>>>>> recovery, whenever this happens, some
nodes are also
> absolutely
> >>>>>>>> sure
> >>>>>>>>>> to run
> >>>>>>>>>>>> OOM. The leaders usually live longest,
the replica's don't,
> their
> >>>>>>>> heap
> >>>>>>>>>>>> usage peaks every time, consistently.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Markus
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>>> From:Shawn Heisey <apache@elyograg.org>
> >>>>>>>>>>>>>> Sent: Saturday 27th January
2018 0:49
> >>>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>>> Subject: Re: 7.2.1 cluster dies
within minutes after restart
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 1/26/2018 10:02 AM, Markus
Jelsma wrote:
> >>>>>>>>>>>>>>> o.a.z.ClientCnxn Client
session timed out, have not heard
> >>>>>>>> from
> >>>>>>>>>>>> server in 22130ms (although zkClientTimeOut
is 30000).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Are you absolutely certain that
there is a setting for
> >>>>>>>>>> zkClientTimeout
> >>>>>>>>>>>>>> that is actually getting applied?
 The default value in
> Solr's
> >>>>>>>>>> example
> >>>>>>>>>>>>>> configs is 30 seconds, but the
internal default in the code
> >>>>>>>> (when
> >>>>>>>>>> no
> >>>>>>>>>>>>>> configuration is found) is still
15.  I have confirmed this
> in
> >>>>>>>> the
> >>>>>>>>>>>> code.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Looks like SolrCloud doesn't
log the values it's using for
> >>>>>>>> things
> >>>>>>>>>> like
> >>>>>>>>>>>>>> zkClientTimeout.  I think it
should.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-11915
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Shawn
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --
> >>>>> Ere Maijala
> >>>>> Kansalliskirjasto / The National Library of Finland
> >>>>>
> >>>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message