lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Date Wed, 03 Apr 2013 13:22:29 GMT
Ok, so clearing the transaction log allowed things to go again.  I am going
to clear the index and try to replicate the problem on 4.2.0 and then I'll
try on 4.2.1


On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmiller@gmail.com> wrote:

> No, not that I know if, which is why I say we need to get to the bottom of
> it.
>
> - Mark
>
> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>
> > Mark
> > It's there a particular jira issue that you think may address this? I
> read
> > through it quickly but didn't see one that jumped out
> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2003@gmail.com> wrote:
> >
> >> I brought the bad one down and back up and it did nothing.  I can clear
> >> the index and try4.2.1. I will save off the logs and see if there is
> >> anything else odd
> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmiller@gmail.com> wrote:
> >>
> >>> It would appear it's a bug given what you have said.
> >>>
> >>> Any other exceptions would be useful. Might be best to start tracking
> in
> >>> a JIRA issue as well.
> >>>
> >>> To fix, I'd bring the behind node down and back again.
> >>>
> >>> Unfortunately, I'm pressed for time, but we really need to get to the
> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> (spreading
> >>> to mirrors now).
> >>>
> >>> - Mark
> >>>
> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2003@gmail.com> wrote:
> >>>
> >>>> Sorry I didn't ask the obvious question.  Is there anything else that
> I
> >>>> should be looking for here and is this a bug?  I'd be happy to troll
> >>>> through the logs further if more information is needed, just let me
> >>> know.
> >>>>
> >>>> Also what is the most appropriate mechanism to fix this.  Is it
> >>> required to
> >>>> kill the index that is out of sync and let solr resync things?
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2003@gmail.com>
> >>> wrote:
> >>>>
> >>>>> sorry for spamming here....
> >>>>>
> >>>>> shard5-core2 is the instance we're having issues with...
> >>>>>
> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>> SEVERE: shard update error StdNode:
> >>>>>
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>> :
> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
non
> >>> ok
> >>>>> status:503, message:Service Unavailable
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>       at
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>       at
> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>       at
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>       at
> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>       at
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>       at
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>       at java.lang.Thread.run(Thread.java:662)
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2003@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> here is another one that looks interesting
> >>>>>>
> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException
log
> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
we
> are
> >>>>>> the leader, but locally we don't think so
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>       at
> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2003@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Looking at the master it looks like at some point there
were shards
> >>> that
> >>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>
> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred
-
> >>> updating... (live
> >>>>>>> nodes size: 12)
> >>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>> process
> >>>>>>> INFO: Updating live nodes... (9)
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>> INFO: Running the leader process.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>> INFO: My last published State was Active, it's okay to be
the
> leader.
> >>>>>>> Apr 2, 2013 8:12:52 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
> >>>> wrote:
> >>>>>>>
> >>>>>>>> I don't think the versions you are thinking of apply
here.
> Peersync
> >>>>>>>> does not look at that - it looks at version numbers
for updates in
> >>> the
> >>>>>>>> transaction log - it compares the last 100 of them on
leader and
> >>> replica.
> >>>>>>>> What it's saying is that the replica seems to have versions
that
> >>> the leader
> >>>>>>>> does not. Have you scanned the logs for any interesting
> exceptions?
> >>>>>>>>
> >>>>>>>> Did the leader change during the heavy indexing? Did
any zk
> session
> >>>>>>>> timeouts occur?
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2003@gmail.com>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> I am currently looking at moving our Solr cluster
to 4.2 and
> >>> noticed a
> >>>>>>>>> strange issue while testing today.  Specifically
the replica has
> a
> >>>>>>>> higher
> >>>>>>>>> version than the master which is causing the index
to not
> >>> replicate.
> >>>>>>>>> Because of this the replica has fewer documents
than the master.
> >>> What
> >>>>>>>>> could cause this and how can I resolve it short
of taking down
> the
> >>>>>>>> index
> >>>>>>>>> and scping the right version in?
> >>>>>>>>>
> >>>>>>>>> MASTER:
> >>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>> Num Docs:164880
> >>>>>>>>> Max Doc:164880
> >>>>>>>>> Deleted Docs:0
> >>>>>>>>> Version:2387
> >>>>>>>>> Segment Count:23
> >>>>>>>>>
> >>>>>>>>> REPLICA:
> >>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>> Num Docs:164773
> >>>>>>>>> Max Doc:164773
> >>>>>>>>> Deleted Docs:0
> >>>>>>>>> Version:3001
> >>>>>>>>> Segment Count:30
> >>>>>>>>>
> >>>>>>>>> in the replicas log it says this:
> >>>>>>>>>
> >>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>
> >>>>>>>>
> >>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
sync
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
nUpdates=100
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>> handleVersions
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>> Received 100 versions from
> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>> handleVersions
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>
> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
sync
> >>>>>>>>>
> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> which again seems to point that it thinks it has
a newer version
> of
> >>>>>>>> the
> >>>>>>>>> index so it aborts.  This happened while having
10 threads
> indexing
> >>>>>>>> 10,000
> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
 Any
> thoughts
> >>> on
> >>>>>>>> this
> >>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message