lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neal Ensor <nen...@gmail.com>
Subject Re: "Classic" 4.2 master-slave replication not completing
Date Mon, 01 Jul 2013 19:07:47 GMT
is it conceivable that there's too much traffic, causing Solr to stall
re-opening the searcher (thus releasing to the new index)?  I'm grasping at
straws, and this is beginning to bug me a lot.  The traffic logs wouldn't
seem to support this (apart from periodic health-check pings, the load is
distributed fairly evenly across 3 slaves by a load-balancer tool).  After
35+ minutes this morning, none of the three successfully "unstuck", and had
to be manually core-reloaded.

Is there perhaps a configuration element I'm overlooking that might make
solr a bit less "friendly" about it, and just dump the searchers/reopen
when replication completes?

As a side note, I'm getting really frustrated with trying to get log4j
logging on 4.3.1 set up; my tomcat container persists in complaining that
it cannot find log4j.properties, when I've put it in the WEB-INF/classes of
the war file, have SLF4j libraries AND log4j at the shared container "lib"
level, and log4j.debug turned on.  I can't find any excuses why it cannot
seem to locate the configuration.

Any suggestions or pointers would be greatly appreciated.  Thanks!


On Thu, Jun 27, 2013 at 10:35 AM, Mark Miller <markrmiller@gmail.com> wrote:

> Odd - looks like it's stuck waiting to be notified that a new searcher is
> ready.
>
> - Mark
>
> On Jun 27, 2013, at 8:58 AM, Neal Ensor <nensor@gmail.com> wrote:
>
> > Okay, I have done this (updated to 4.3.1 across master and four slaves;
> one
> > of these is my own PC for experiments, it is not being accessed by
> clients).
> >
> > Just had a minor replication this morning, and all three slaves are
> "stuck"
> > again.  Replication supposedly started at 8:40, ended 30 seconds later or
> > so (on my local PC, set up identically to the other three slaves).  The
> > three slaves will NOT complete the roll-over to the new index.  All three
> > index folders have a write.lock and latest files are dated 8:40am (now it
> > is 8:54am, with no further activity in the index folders).  There exists
> an
> > "index.20130627084000061" (or some variation thereof) in all three
> slaves'
> > data folder.
> >
> > The seemingly-relevant thread dump of a "snappuller" thread on each of
> > these slaves:
> >
> >   - sun.misc.Unsafe.park(Native Method)
> >   - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
> >   -
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
> >   - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
> >   - java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >   -
> >
> org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631)
> >   -
> >
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446)
> >   -
> >
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
> >   - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
> >   -
> >   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >   -
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >   - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >   -
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >   -
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> >   -
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> >   - java.lang.Thread.run(Thread.java:662)
> >
> >
> > Here they sit.  My local PC "slave" replicated very quickly, switched
> over
> > to the new generation (206) immediately.  I am not sure why the three
> > slaves are dragging on this.  If there's any configuration elements or
> > other details you need, please let me know.  I can manually "kick" them
> by
> > reloading the core from the admin pages, but obviously I would like this
> to
> > be a hands-off process.  Any help is greatly appreciated; this has been
> > bugging me for some time now.
> >
> >
> >
> > On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar <
> > shalinmangar@gmail.com> wrote:
> >
> >> A bunch of replication related issues were fixed in 4.2.1 so you're
> >> better off upgrading to 4.2.1 or later (4.3.1 is the latest release).
> >>
> >> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nensor@gmail.com> wrote:
> >>> As a bit of background, we run a setup (coming from 3.6.1 to 4.2
> >> relatively
> >>> recently) with a single master receiving updates with three slaves
> >> pulling
> >>> changes in.  Our index is around 5 million documents, around 26GB in
> size
> >>> total.
> >>>
> >>> The situation I'm seeing is this:  occasionally we update the master,
> and
> >>> replication begins on the three slaves, seems to proceed normally until
> >> it
> >>> hits the end.  At that point, it "sticks"; there's no messages going on
> >> in
> >>> the logs, nothing on the admin page seems to be happening.  I sit there
> >> for
> >>> sometimes upwards of 30 minutes, seeing no further activity in the
> index
> >>> folder(s).   After a while, I go to the core admin page and manually
> >> reload
> >>> the core, which "catches it up".  It seems like the index readers /
> >> writers
> >>> are not releasing the index otherwise?  The configuration is set to
> >> reopen;
> >>> very occasionally this situation actually fixes itself after a longish
> >>> period of time, but it seems very annoying.
> >>>
> >>> I had at first suspected this to be due to our underlying shared (SAN)
> >>> storage, so we installed SSDs in all three slave machines, and moved
> the
> >>> entire indexes to those.  It did not seem to affect this issue at all
> >>> (additionally, I didn't really see the expected performance boost, but
> >>> that's a separate issue entirely).
> >>>
> >>> Any ideas?  Any configuration details I might share/reconfigure?  Any
> >>> suggestions are appreciated. I could also upgrade to the later 4.3+
> >>> versions, if that might help.
> >>>
> >>> Thanks!
> >>>
> >>> Neal Ensor
> >>> nensor@gmail.com
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message