lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: 6.6 cloud starting to eat CPU after 8+ hours
Date Wed, 26 Jul 2017 21:11:18 GMT
Hello Mikhail,

Spot on, there is indeed not enough heap when our nodes are in this crazy state. When the
nodes are happy, the average heap consumption is 50 to 60 percent, at peak when indexing there
is in general enough heap for the warming to run smoothly.

I probably forgot to mention that high CPU is in our case also high heap consumption when
the nodes act mad. The stack trace you saw was a crazy node while documents were indexed,
so the blocking you mention makes sense.

I still believe this is not a heap issue but something odd in Solr that eats in (INSERT SITUATION)
unreasonably amounts of heap. Remind, the bad node went back in a normal 50-60 percent heap
consumption and normal CPU time after _other_ nodes got restarted. All nodes were bad and
went normal after restart. Bad nodes that were not restarted then suddenly revived and also
went back to normal. This is very odd behavior.

Observing this, i am inclined to think that Solr's inter-node communication can get into a
weird state, eating heap, eating CPU, going bad. Using CPU or heap sampling it is very hard,
for me at least, to quickly spot something that is bad so i am clueless. What do you think?

How can a bad node become normal just by restarting another bad node? Puzzling..

Thanks,
Markus
 
-----Original message-----
> From:Mikhail Khludnev <mkhl@apache.org>
> Sent: Wednesday 26th July 2017 10:50
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> 
> I've looked into stacktrace.
> I see that one thread triggers commit via update command. And it's blocked
> on searcher warming. The really odd thing is total state = BLOCKED. Can you
> check that there is a spare heap space available during indexing peak? And
> also that there free RAM (after heap allocation)? Can it happen that
> warming query is unnecessary heavy? Also, explicit commits might cause
> issues, consider the best practice with auto-commit and openSearcher=false
> and soft commit when necessary.
> 
> 
> On Mon, Jul 24, 2017 at 4:35 PM, Markus Jelsma <markus.jelsma@openindex.io>
> wrote:
> 
> > Alright, after adding a field and full cluster restart, the cluster is
> > going nuts once again and this time almost immediately after the restart.
> >
> > I have now restarted all but one so there is some room to compare, or so i
> > thought. Now, the node i didn't restart also drops CPU-usage. This seems to
> > correspond to another incident some time ago where all nodes went crazy
> > over an extended period, but calmed down after a few were restarted. So it
> > could be a problem of inter-node communication.
> >
> > The index is is one segment at this moment but some documents are being
> > indexed. Some queries are executed but not very much. Attaching the stack
> > anyway.
> >
> >
> >
> >
> >
> > -----Original message-----
> > > From:Mikhail Khludnev <mkhl@apache.org>
> > > Sent: Wednesday 19th July 2017 14:41
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > >
> > > You can get stack from kill -3 jstack even from solradmin. Overall, this
> > > behavior looks like typical heavy merge kicking off from time to time.
> > >
> > > On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma <
> > markus.jelsma@openindex.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > No i cannot expose the stack, VisualVM samples won't show it to me.
> > > >
> > > > I am not sure if they're about to sync all the time, but every 15
> > minutes
> > > > some documents are indexed (3 - 4k). For some reason, index time does
> > > > increase with latency / CPU usage.
> > > >
> > > > This situation runs fine for many hours, then it will slowly start to
> > go
> > > > bad, until nodes are restarted (or index size decreased).
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Mikhail Khludnev <mkhl@apache.org>
> > > > > Sent: Wednesday 19th July 2017 14:18
> > > > > To: solr-user <solr-user@lucene.apache.org>
> > > > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > > > >
> > > > > >
> > > > > > The real distinction between busy and calm nodes is that busy
> > nodes all
> > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > FieldsReader.terms()
> > > > as
> > > > > > second to fillBuffer(), what are they doing?
> > > > >
> > > > >
> > > > > Can you expose the stack deeper?
> > > > > Can they start to sync shards due to some reason?
> > > > >
> > > > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma <
> > > > markus.jelsma@openindex.io>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Another peculiarity here, our six node (2 shards / 3 replica's)
> > > > cluster is
> > > > > > going crazy after a good part of the day has passed. It starts
> > eating
> > > > CPU
> > > > > > for no good reason and its latency goes up. Grafana graphs show
the
> > > > problem
> > > > > > really well
> > > > > >
> > > > > > After restarting 2/6 nodes, there is also quite a distinction
in
> > the
> > > > > > VisualVM monitor views, and the VisualVM CPU sampler reports
> > (sorted on
> > > > > > self time (CPU)). The busy nodes are deeply red in o.a.h.impl.io.
> > > > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted
> > nodes
> > > > are
> > > > > > not.
> > > > > >
> > > > > > The real distinction between busy and calm nodes is that busy
> > nodes all
> > > > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$
> > FieldsReader.terms()
> > > > as
> > > > > > second to fillBuffer(), what are they doing?! Why? The calm
nodes
> > don't
> > > > > > show this at all. Busy nodes all have o.a.l.codec stuff on top,
> > > > restarted
> > > > > > nodes don't.
> > > > > >
> > > > > > So, actually, i don't have a clue! Any, any ideas?
> > > > > >
> > > > > > Thanks,
> > > > > > Markus
> > > > > >
> > > > > > Each replica is underpowered but performing really well after
> > restart
> > > > (and
> > > > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million,
> > index
> > > > size
> > > > > > 18 GB.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 

Mime
View raw message