lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SOLR4 cluster - strange CPU spike on slave
Date Fri, 07 Dec 2012 00:01:14 GMT
Not quite. Too much memory for the JVM means that it starves the op system
and the caching that goes on there. Objects that consume memory are created
all the time in Solr. They won't be recovered until some threshold is
passed. So you can be sure that the more memory you allocate to the JVM,
the more will be used. And the GC pauses will eventually get quite long.

So over-allocating memory to the JVM is discouraged.

Best
Erick


On Wed, Dec 5, 2012 at 11:28 PM, John Nielsen <jn@mcb.dk> wrote:

> I'm not sure I understand why this is important. Too much memory would just
> be unused.
>
> This is what the heap looks now:
>
> Heap Configuration:
>    MinHeapFreeRatio = 40
>    MaxHeapFreeRatio = 70
>    MaxHeapSize      = 17179869184 (16384.0MB)
>    NewSize          = 21757952 (20.75MB)
>    MaxNewSize       = 283508736 (270.375MB)
>    OldSize          = 65404928 (62.375MB)
>    NewRatio         = 7
>    SurvivorRatio    = 8
>    PermSize         = 21757952 (20.75MB)
>    MaxPermSize      = 176160768 (168.0MB)
>
> Heap Usage:
> New Generation (Eden + 1 Survivor Space):
>    capacity = 255197184 (243.375MB)
>    used     = 108828496 (103.78694152832031MB)
>    free     = 146368688 (139.5880584716797MB)
>    42.644865548359654% used
> Eden Space:
>    capacity = 226885632 (216.375MB)
>    used     = 83498424 (79.63030242919922MB)
>    free     = 143387208 (136.74469757080078MB)
>    36.80198841326365% used
> From Space:
>    capacity = 28311552 (27.0MB)
>    used     = 25330072 (24.156639099121094MB)
>    free     = 2981480 (2.8433609008789062MB)
>    89.46903370044849% used
> To Space:
>    capacity = 28311552 (27.0MB)
>    used     = 0 (0.0MB)
>    free     = 28311552 (27.0MB)
>    0.0% used
> concurrent mark-sweep generation:
>    capacity = 16896360448 (16113.625MB)
>    used     = 12452710200 (11875.829887390137MB)
>    free     = 4443650248 (4237.795112609863MB)
>    73.70054775005708% used
> Perm Generation:
>    capacity = 70578176 (67.30859375MB)
>    used     = 37652032 (35.90777587890625MB)
>    free     = 32926144 (31.40081787109375MB)
>    53.347981109627995% used
>
>
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
>
> On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > If this is caused by index segment merging you should be able to see that
> > very clearly on the Index report in SPM, where you would see sudden graph
> > movement at the time of spike and corresponding to CPU and disk activity.
> > I think uncommenting that infostream in solrconfig would also show it.
> >
> > Otis
> > --
> > SOLR Performance Monitoring - http://sematext.com/spm
> > On Nov 28, 2012 9:20 PM, "Erick Erickson" <erickerickson@gmail.com>
> wrote:
> >
> > > Am I reading this right? All you're doing on varnish1 is replicating to
> > it?
> > > You're not searching or indexing? I'm sure I'm misreading this.....
> > >
> > >
> > > "The spike, which only lasts for a couple of minutes, sends the disks
> > > racing" This _sounds_ suspiciously like segment merging, especially the
> > > "disks racing" bit. Or possibly replication. Neither of which make much
> > > sense. But is there any chance that somehow multiple commits are being
> > > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > > either.
> > >
> > > And the whole bit about nothing going to the logs is just bizarre. I'm
> > > tempted to claim hardware gremlins, especially if you see nothing
> similar
> > > on varnish2. Or some other process is pegging the machine. All of which
> > is
> > > a way of saying "I have no idea"....
> > >
> > > Yours in bewilderment,
> > > Erick
> > >
> > >
> > >
> > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn@mcb.dk> wrote:
> > >
> > > > I apologize for the late reply.
> > > >
> > > > The query load is more or less stable during the spikes. There are
> > always
> > > > fluctuations, but nothing on the order of magnitude that could
> explain
> > > this
> > > > spike. In fact, the latest spike occured last night when there were
> > > almost
> > > > noone using it.
> > > >
> > > > To test a hunch of mine, I tried to deactivate all caches by
> commenting
> > > out
> > > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> > it
> > > > has anything to do with cache warming or hits/misses or anything of
> the
> > > > sort.
> > > >
> > > > One interesting thing GC though. This is our latest spike with cpu
> load
> > > > (this server has 8 cores, so a load higher than 8 is potentially
> > > > troublesome):
> > > >
> > > > 2012.Nov.27 19:58:18    2.27
> > > > 2012.Nov.27 19:57:17    4.06
> > > > 2012.Nov.27 19:56:18    8.95
> > > > 2012.Nov.27 19:55:17    19.97
> > > > 2012.Nov.27 19:54:17    32.27
> > > > 2012.Nov.27 19:53:18    1.67
> > > > 2012.Nov.27 19:52:17    1.6
> > > > 2012.Nov.27 19:51:18    1.77
> > > > 2012.Nov.27 19:50:17    1.89
> > > >
> > > > This is what the GC was doing around that time:
> > > >
> > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > > 4777586K->277641K(4969216K)]
> > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > > 4749769K->325171K(5068096K)]
> > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > > 4911603K->306181K(5071168K)]
> > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > > 4892613K->376175K(5075328K)]
> > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > > 4972847K->271468K(4868160K)]
> > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> > sys=0.01,
> > > > real=0.07 secs]
> > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > > 4868140K->336421K(5090944K)]
> > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> > sys=0.01,
> > > > real=0.08 secs]
> > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > > 4950373K->340513K(5092864K)]
> > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > > 4954465K->480488K(4952000K)]
> > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> > sys=0.09,
> > > > real=0.19 secs]
> > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > > 4951976K->560434K(5031936K)]
> > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> > sys=0.12,
> > > > real=0.35 secs]
> > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > > 560434K->0K(5031936K)]
> > > > [PSOldGen: 4514851K->2793342K(5047296K)]
> 5075285K->2793342K(10079232K)
> > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times:
> user=5.23
> > > > sys=0.00, real=5.23 secs]
> > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > > 4471488K->87249K(4647168K)]
> > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> > sys=0.00,
> > > > real=0.04 secs]
> > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > > 4173969K->171306K(4839552K)]
> > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> > sys=0.00,
> > > > real=0.05 secs]
> > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > > 4258026K->156193K(4802368K)]
> > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > > 4254753K->156028K(4826880K)]
> > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > > 4254588K->413148K(4860288K)]
> > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> > sys=0.01,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > > 4564892K->412801K(4866432K)]
> > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> > sys=0.00,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > > 4564545K->434702K(4870592K)]
> > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> > sys=0.00,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > > 4591822K->500171K(4657344K)]
> > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> > sys=0.00,
> > > > real=0.16 secs]
> > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > > 4657291K->453204K(4849536K)]
> > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > > 4560404K->533101K(4640320K)]
> > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> > sys=0.00,
> > > > real=0.13 secs]
> > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > > 4640301K->480969K(4822720K)]
> > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > > 4534281K->572295K(4625664K)]
> > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > > 4625607K->486964K(4800640K)]
> > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> > sys=0.00,
> > > > real=0.12 secs]
> > > >
> > > > So the GC was doing a full collection around that time, but that only
> > > > accounts for around 5 secs. Not the couple of minutes the spike
> lasts.
> > > >
> > > > I am really at a complete loss as to what this could be. Google has
> not
> > > > given me any clues.
> > > >
> > > >
> > > > Med venlig hilsen / Best regards
> > > >
> > > > *John Nielsen*
> > > > Programmer
> > > >
> > > >
> > > >
> > > > *MCB A/S*
> > > > Enghaven 15
> > > > DK-7500 Holstebro
> > > >
> > > > Kundeservice: +45 9610 2824
> > > > post@mcb.dk
> > > > www.mcb.dk
> > > >
> > > >
> > > >
> > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > > otis.gospodnetic@gmail.com> wrote:
> > > >
> > > > > Strange indeed. What about query load/ayes during that time? What
> > about
> > > > GC?
> > > > > And does cache hit rate drop?
> > > > >
> > > > > Otis
> > > > > --
> > > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn@mcb.dk> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are seeing a strange CPU spike on one of our solr4 servers
> which
> > > we
> > > > > are
> > > > > > unable to explain. The spike, which only lasts for a couple
of
> > > minutes,
> > > > > > sends the disks racing. This happens a few times a times a day.
> > This
> > > is
> > > > > > what the load looks like:
> > > > > >
> > > > > > 2012.Nov.14 13:37:17    2.77
> > > > > > 2012.Nov.14 13:36:17    3.65
> > > > > > 2012.Nov.14 13:35:18    3.92
> > > > > > 2012.Nov.14 13:34:17    3.95
> > > > > > 2012.Nov.14 13:33:18    6.56
> > > > > > 2012.Nov.14 13:32:17    10.79
> > > > > > 2012.Nov.14 13:31:17    24.38
> > > > > > 2012.Nov.14 13:30:17    63.35
> > > > > > 2012.Nov.14 13:29:17    24.68
> > > > > > 2012.Nov.14 13:28:17    2.44
> > > > > > 2012.Nov.14 13:27:18    3.51
> > > > > > 2012.Nov.14 13:26:17    5.26
> > > > > > 2012.Nov.14 13:25:18    5.71
> > > > > > 2012.Nov.14 13:24:17    2.7
> > > > > >
> > > > > > The problem is that out of a 3 minute spike, I get about 40
> seconds
> > > of
> > > > > > silence in the logs. This log usually adds like a thousand lines
> > > every
> > > > > > second. Not being able to communicate with the server for this
> > long,
> > > > > breaks
> > > > > > our use case.
> > > > > >
> > > > > > We have two servers, varnish01 and varnish02. We used to feed
> data
> > to
> > > > > > varnish02, replicate it to varnish01 where the data is then
read
> > > from.
> > > > > When
> > > > > > we discovered this issue, we moved all traffic to varnish02
so
> that
> > > > data
> > > > > is
> > > > > > being replicated to varnish01, but other than that, gets zero
> > > traffic.
> > > > > The
> > > > > > spike did not disappear.
> > > > > >
> > > > > > The spike we are seeing is on varnish01 only.
> > > > > >
> > > > > > Please note that our use case requires us to continuously feed
> > large
> > > > > > amounts of data from our main system in the order of up to 1.000
> > > > > registers
> > > > > > every minute. Our solrconfig.xml is attached.
> > > > > >
> > > > > > Has anyone seen this phenomenon before?
> > > > > >
> > > > > > Med venlig hilsen / Best regards
> > > > > >
> > > > > > *John Nielsen*
> > > > > > Programmer
> > > > > >
> > > > > >
> > > > > >
> > > > > > *MCB A/S*
> > > > > > Enghaven 15
> > > > > > DK-7500 Holstebro
> > > > > >
> > > > > > Kundeservice: +45 9610 2824
> > > > > > post@mcb.dk
> > > > > > www.mcb.dk
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message