lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <a.benede...@sease.io>
Subject Re: Solr 6.3.0 - recovery failed
Date Wed, 01 Feb 2017 18:37:35 GMT
I can't debug the code  now,  but if you access the logs,  directly ( not
from the ui),  is there any " caused by"  associated to the recovery
failure exception?
Cheers

On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
wrote:

> In HDFS when a node fails it will leave behind write.lock files in HDFS.
> These files have to be manually removed; otherwise the shards/replicas that
> have write.lock files left behind will not start.  Since I can't tell which
> physical node is hosting which shard/replica, I stop all the nodes, delete
> all the write.lock files in HDFS and restart.
>
> You are correct - only one replica is failing to start.  The other
> replicas on the same physical node are coming up OK. Picture is worth a
> thousand words so:
> http://lovehorsepower.com/images/Cluster1.jpg
>
> Errors:
> http://lovehorsepower.com/images/ClusterSolr2.jpg
>
> -Joe
>
> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>
>> Ok,  it is clearer now.
>> You have 9 solr nodes running,  one per physical machine.
>> So each node has a number cores ( both replicas and leaders).
>> When the node died,  you got a lot of indexes corrupted.
>> I still miss why you restarted the others 8 working nodes ( I was
>> expecting
>> you to restart only the failed one)
>>
>> When you mention that only one replica  is failing,  you mean that the
>> solr
>> node is up and running and only  one solr core ( the replica of one shard)
>>   keeps failing?
>> Or all the local cores in that node are failing  to recover?
>>
>> Cheers
>>
>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
>> wrote:
>>
>> Thank you for the response.
>> There are no virtual machines in the configuration.  The collection has 45
>> shards with 3 replicas each spread across the 9 physical boxes; each box
>> is
>> running one copy of solr.  I've tried to restart just the one node after
>> the other 8 (and all their shards/replicas) came up, but this one replica
>> seems to be in perma-recovery.
>>
>> Shard Count: 45
>> replicationFactor: 3
>> maxShardsPerNode: 50
>> router: compositeId
>> autoAddReplicas: false
>>
>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>
>> _TUNE is:
>> "-XX:+UseG1GC \
>> -XX:MaxDirectMemorySize=8g
>> -XX:+PerfDisableSharedMem \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=32m \
>> -XX:MaxGCPauseMillis=500 \
>> -XX:InitiatingHeapOccupancyPercent=75 \
>> -XX:ParallelGCThreads=16 \
>> -XX:+UseLargePages \
>> -XX:-ResizePLAB \
>> -XX:+AggressiveOpts"
>>
>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>> afraid to continue indexing data if this one node will never come back.
>> Thanks for help!
>>
>> -Joe
>>
>>
>>
>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>
>> Let me try to summarize .
>>> How many virtual machines on top of the 9 physical ?
>>> How many Solr processes ( replicas ? )
>>>
>>> If you had 1 node compromised.
>>> I assume you have replicas as well right ?
>>>
>>> Can you explain a little bit better your replicas configuration ?
>>> Why you had to stop all the nodes ?
>>>
>>> I would expect the stop of the solr node failing, cleanup of the index
>>> and
>>> restart.
>>> Automatically it would recover from the leader.
>>>
>>> Something is suspicious here, let us know !
>>>
>>> Cheers
>>>
>>>
>>>
>>> -----
>>> ---------------
>>> Alessandro Benedetti
>>> Search Consultant, R&D Software Engineer, Director
>>> Sease Ltd. - www.sease.io
>>> --
>>> View this message in context: http://lucene.472066.n3.nabble
>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message