lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: 6.6 cloud starting to eat CPU after 8+ hours
Date Mon, 24 Jul 2017 13:35:57 GMT
Alright, after adding a field and full cluster restart, the cluster is going nuts once again
and this time almost immediately after the restart.

I have now restarted all but one so there is some room to compare, or so i thought. Now, the
node i didn't restart also drops CPU-usage. This seems to correspond to another incident some
time ago where all nodes went crazy over an extended period, but calmed down after a few were
restarted. So it could be a problem of inter-node communication.

The index is is one segment at this moment but some documents are being indexed. Some queries
are executed but not very much. Attaching the stack anyway.



 
 
-----Original message-----
> From:Mikhail Khludnev <mkhl@apache.org>
> Sent: Wednesday 19th July 2017 14:41
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> 
> You can get stack from kill -3 jstack even from solradmin. Overall, this
> behavior looks like typical heavy merge kicking off from time to time.
> 
> On Wed, Jul 19, 2017 at 3:31 PM, Markus Jelsma <markus.jelsma@openindex.io>
> wrote:
> 
> > Hello,
> >
> > No i cannot expose the stack, VisualVM samples won't show it to me.
> >
> > I am not sure if they're about to sync all the time, but every 15 minutes
> > some documents are indexed (3 - 4k). For some reason, index time does
> > increase with latency / CPU usage.
> >
> > This situation runs fine for many hours, then it will slowly start to go
> > bad, until nodes are restarted (or index size decreased).
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Mikhail Khludnev <mkhl@apache.org>
> > > Sent: Wednesday 19th July 2017 14:18
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> > >
> > > >
> > > > The real distinction between busy and calm nodes is that busy nodes all
> > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms()
> > as
> > > > second to fillBuffer(), what are they doing?
> > >
> > >
> > > Can you expose the stack deeper?
> > > Can they start to sync shards due to some reason?
> > >
> > > On Wed, Jul 19, 2017 at 12:35 PM, Markus Jelsma <
> > markus.jelsma@openindex.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > Another peculiarity here, our six node (2 shards / 3 replica's)
> > cluster is
> > > > going crazy after a good part of the day has passed. It starts eating
> > CPU
> > > > for no good reason and its latency goes up. Grafana graphs show the
> > problem
> > > > really well
> > > >
> > > > After restarting 2/6 nodes, there is also quite a distinction in the
> > > > VisualVM monitor views, and the VisualVM CPU sampler reports (sorted on
> > > > self time (CPU)). The busy nodes are deeply red in o.a.h.impl.io.
> > > > AbstractSessionInputBuffer.fillBuffer (as usual), the restarted nodes
> > are
> > > > not.
> > > >
> > > > The real distinction between busy and calm nodes is that busy nodes all
> > > > have o.a.l.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms()
> > as
> > > > second to fillBuffer(), what are they doing?! Why? The calm nodes don't
> > > > show this at all. Busy nodes all have o.a.l.codec stuff on top,
> > restarted
> > > > nodes don't.
> > > >
> > > > So, actually, i don't have a clue! Any, any ideas?
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > > Each replica is underpowered but performing really well after restart
> > (and
> > > > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million, index
> > size
> > > > 18 GB.
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 
Mime
View raw message