lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: trunk is unable to replicate between nodes ( Unable to download ... completely)
Date Thu, 01 Nov 2012 09:18:31 GMT
Hi,

What we're seeing is two replica's that are, for whatever reason, not completely in sync and
unable to recover. While one node is in the `recovering state` we can fool it to be in active
state by simply reloading the core on the machine on which it is in a bad state. Some arbitrary
time later it will return to recovering.

Below is the communication between and the exceptions on the nodes in order:


Recovering machine starts the recovery process:

2012-11-01 08:55:13,533 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting
Replication Recovery. core=shard_i
2012-11-01 08:55:13,534 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new
http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:13,732 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster
state change has occurred - updating... (10)
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr
path=/admin/ping params={} hits=1 status=0 QTime=10 
2012-11-01 08:55:14,525 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr
path=/admin/ping params={} status=0 QTime=10 
2012-11-01 08:55:14,700 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr
path=/admin/ping params={} hits=1 status=0 QTime=5 
2012-11-01 08:55:14,701 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr
path=/admin/ping params={} status=0 QTime=6 
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering
updates. core=shard_i
2012-11-01 08:55:15,546 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer
updates. FSUpdateLog{state=ACTIVE, tlog=null}
2012-11-01 08:55:15,546 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting
to replicate from http://rot09.nl.idx.openindex.io:8080/solr/shard_i/. core=shard_i
2012-11-01 08:55:15,546 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new
http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-01 08:55:15,557 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  No value set
for 'pollInterval'. Timer Task not started.
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation:
166
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation:
3
2012-11-01 08:55:15,559 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication
process
2012-11-01 08:55:15,568 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files
in latest index in master: 324
2012-11-01 08:55:15,569 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No
lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
lockFactory=org.apache.lucene.store.NativeFSLockFactory@6f8288d2; maxCacheMB=48.0 maxMergeSizeMB=4.0)
assuming 'simple'
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return
new directory for /opt/solr/cores/shard_i/data/index.20121101085515568 forceNew:false
2012-11-01 08:55:15,569 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_i/data
2012-11-01 08:55:15,572 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download
to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0)
fullCopy=true


Right now the leader is transmitting packets as we expect:


2012-11-01 08:55:15,567 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr
path=/replication params={command=filelist&generation=166&wt=javabin&q
t=/replication&version=2} status=0 QTime=4 
2012-11-01 08:55:15,576 INFO [solr.core.SolrCore] - [http-8080-exec-3] - : [shard_i] webapp=/solr
path=/replication params={file=_2b9_nrm.cfs&command=filecontent&checks
um=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,583 INFO [solr.core.SolrCore] - [http-8080-exec-5] - : [shard_i] webapp=/solr
path=/replication params={file=_2t8_Lucene41_0.doc&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0

2012-11-01 08:55:15,589 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_i] webapp=/solr
path=/replication params={file=_2sc.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,598 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_i] webapp=/solr
path=/replication params={file=_2qk.tvd&command=filecontent&checksum=t
rue&generation=166&qt=/replication&wt=filestream} status=0 QTime=0 
2012-11-01 08:55:15,603 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_i] webapp=/solr
path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&generation=166&qt=/replication&wt=filestream} status=0 QTime=0



Until the leader cannot read the index and therefore not complete the upload:


2012-11-01 08:55:15,685 WARN [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Exception
while writing response for params: file=_2ii_Lucene41_0.pos&command=filecontent&checksum=true&generation=166&qt=/replication&wt=filestream
java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/shard_i/data/index.20121030152231942/_2ii_Lucene41_0.pos")
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,690 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_i] webapp=/solr
path=/replication params={file=_2ii_Lucene41_0.pos&command=filecontent
&checksum=true&offset=1048576&generation=166&qt=/replication&wt=filestream}
status=0 QTime=0 
2012-11-01 08:55:15,691 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-11-01 08:55:15,692 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:530)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
        at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
        at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
        at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
        at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)


The recovering node isn't very happy and drops the replication and waits...


2012-11-01 08:55:15,686 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching
packets 
java.io.EOFException
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
        at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
2012-11-01 08:55:15,692 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,692 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary
index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_i/data/index.20121101085515568
lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1b195462; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-01 08:55:15,693 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_i/data/index.20121101085515568
2012-11-01 08:55:15,695 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing
directory:/opt/solr/cores/shard_i/data/index
2012-11-01 08:55:15,695 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull
failed :org.apache.solr.common.SolrException: Unable to download _2ii_Lucene41_0.pos completely.
Downloaded 1048576!=1501300
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
        at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
        at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,695 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while
trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
        at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)

2012-11-01 08:55:15,696 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered
updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-01 08:55:15,696 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery
failed - trying again... (8) core=shard_i
2012-11-01 08:55:15,696 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 512.0
seconds before trying to recover again (9)


It seems to me the SolrCloud part is actually running and behaving as it should but the index
file on the leader is corrupt, and that should not happen even on power loss. Any hints?


Thanks
Markus
 
 
-----Original message-----
> From:Markus Jelsma <markus.jelsma@openindex.io>
> Sent: Wed 31-Oct-2012 14:14
> To: solr-user@lucene.apache.org; Markus Jelsma <markus.jelsma@openindex.io>
> Subject: RE: trunk is unable to replicate between nodes ( Unable to download ... completely)
> 
> Ah, we're also seeing Solr lookup an unexisting directory:
> 
> 2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - [http-8080-exec-2] -
: IO error while trying to get the size of the Directory:org.apache.lucene.store.NoSuchDirectoryException:
directory '/opt/solr/cores/shard_a/data/index' does not exist
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220)
>         at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
>         at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
>         at org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
> 
> Instead of data/index it should be looking for data/index.20121030152324761/, which actually
does exist.
> 
>  
>  
> -----Original message-----
> > From:Markus Jelsma <markus.jelsma@openindex.io>
> > Sent: Tue 30-Oct-2012 17:30
> > To: solr-user@lucene.apache.org
> > Subject: trunk is unable to replicate between nodes ( Unable to download ... completely)
> > 
> > Hi,
> > 
> > We're testing again with today's trunk and using the new Lucene 4.1 format by default.
When nodes are not restarted things are kind of stable but restarting nodes leads to a lot
of mayhem. It seems we can get the cluster back up and running by clearing ZK and restarting
everything (another issue) but replication becomes impossible for some nodes leading to a
continuous state of failing recovery etc.
> > 
> > Here are some excepts from the logs:
> > 
> > 2012-10-30 16:12:39,674 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exe
> > c-5] - : null:java.lang.IndexOutOfBoundsException
> >         at java.nio.Buffer.checkBounds(Buffer.java:530)
> >         at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
> >         at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferInde
> > xInput.java:91)
> >         at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(
> > ReplicationHandler.java:1065)
> >         at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > 2012-10-30 16:10:32,220 ERROR [solr.handler.ReplicationHandler] - [RecoveryThrea
> > d] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download
> >  _x.fdt completely. Downloaded 13631488!=13843504
> >         at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapP
> > uller.java:1237)
> >         at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(Sna
> > pPuller.java:1118)
> >         at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java
> > :716)
> >         at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
> >         at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
> >         at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
> >         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
> > 
> > 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - [http-8080-exec
> > -3] - : Exception while writing response for params: file=_p_Lucene41_0.doc&comm
> > and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream
> > java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc")
> >         at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
> >         at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
> >         at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> > 
> > 
> > Needless to say i'm puzzled so i'm wondering if anyone has seen this before or have
some hints that might help digg further.
> > 
> > Thanks,
> > Markus
> > 
> 

Mime
View raw message