lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <>
Subject Re: Recovery Issue - Solr 6.6.1 and HDFS
Date Wed, 22 Nov 2017 13:44:08 GMT
Hi Kevin -
* HDFS is part of Cloudera 5.12.0.
* Solr is co-located in most cases.  We do have several nodes that run 
on servers that are not data nodes, but most do. Unfortunately, our 
nodes are not the same size.  Some nodes have 8TBytes of disk, while our 
largest nodes are 64TBytes.  This results in a lot of data that needs to 
go over the network.

* Command is:
/usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m 
-XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem 
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m 
-XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75 
-XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB 
-XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime 
-Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=300000,,,,

-Dsolr.log.dir=/opt/solr6/server/logs -Djetty.port=9100 -DSTOP.PORT=8100 
-DSTOP.KEY=solrrocks -Dhost=tarvos -Duser.timezone=UTC 
-Djetty.home=/opt/solr6/server -Dsolr.solr.home=/opt/solr6/server/solr 
-Dsolr.install.dir=/opt/solr6 -Dsolr.clustering.enabled=true 
-Dsolr.lock.type=hdfs -Dsolr.autoSoftCommit.maxTime=120000 
-Dsolr.autoCommit.maxTime=1800000 -Dsolr.solr.home=/etc/solr6 
-Xss256k -Dsolr.log.muteconsole 
-XX:OnOutOfMemoryError=/opt/solr6/bin/ 9100 
/opt/solr6/server/logs -jar start.jar --module=http

* We have enabled short circuit reads.

Right now, we have a relatively small block cache due to the 
requirements that the servers run other software.  We tried to find the 
best balance between block cache size, and RAM for programs, while still 
giving enough for local FS cache.  This came out to be 84 128M blocks - 
or about 10G for the cache per node (45 nodes total).

<directoryFactory name="DirectoryFactory"
         <bool name="solr.hdfs.blockcache.enabled">true</bool>
         <bool name="">true</bool>
         <int name="solr.hdfs.blockcache.slab.count">84</int>
         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
         <bool name="">true</bool>
         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">128</int>
         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int>
         <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr6.6.0</str>
         <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</str>

Thanks for reviewing!


On 11/22/2017 8:20 AM, Kevin Risden wrote:
> Joe,
> I have a few questions about your Solr and HDFS setup that could help
> improve the recovery performance.
> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
> * Is Solr colocated with HDFS data nodes?
> * What is the output of "ps aux | grep solr"? (specifically looking for the
> Java arguments that are being set.)
> Depending on how Solr on HDFS was setup, there are some potentially simple
> settings that can help significantly improve performance.
> 1) Short circuit reads
> If Solr is colocated with an HDFS datanode, short circuit reads can improve
> read performance since it skips a network hop if the data is local to that
> node. This requires HDFS native libraries to be added to Solr.
> 2) HDFS block cache in Solr
> Solr without HDFS uses the OS page cache to handle caching data for
> queries. With HDFS, Solr has a special HDFS block cache which allows for
> caching HDFS blocks. This significantly helps query performance. There are
> a few configuration parameters that can help here.
> Kevin Risden
> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp <>
> wrote:
>> Hi Joe,
>> sorry, I have not seen that problem. I would normally not delete a replica
>> if the shard is down but only if there is an active shard. Without an
>> active leader the replica should not be able to recover. I also just had a
>> case where all replicas of a shard stayed in down state and restarts didn't
>> help. This was however also caused by lock files. Once I cleaned them up
>> and restarted all Solr instances that had a replica they recovered.
>> For the lock files I discovered that the index is not always in the
>> "index" folder but can also be in an index.<timestamp> folder. There can be
>> an "" file in the "data" directory in HDFS and this
>> contains the correct index folder name.
>> If you are really desperate you could also delete all but one replica so
>> that the leader election is quite trivial. But this does of course increase
>> the risk of finally loosing the data quite a bit. So I would try looking
>> into the code and figure out what the problem is here and maybe compare the
>> state in HDFS and ZK with a shard that works.
>> regards,
>> Hendrik
>> On 21.11.2017 23:57, Joe Obernberger wrote:
>>> Hi Hendrick - the shards in question have three replicas.  I tried
>>> restarting each one (one by one) - no luck.  No leader is found. I deleted
>>> one of the replicas and added a new one, and the new one also shows as
>>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>>> see anything useful in the logs except the error:
>>> org.apache.solr.common.SolrException: Error getting leader from zk for
>>> shard shard21
>>>      at
>>> java:996)
>>>      at
>>>      at
>>>      at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>>>      at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>>> xecutor.lambda$execute$0(
>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>      at java.util.concurrent.ThreadPoolExecutor$
>>>      at
>>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>>> props
>>>      at
>>>      at
>>>      at
>>> java:963)
>>>      ... 7 more
>>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>> java:111)
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>> java:51)
>>>      at org.apache.zookeeper.ZooKeeper.getData(
>>>      at$7.execute(SolrZkCl
>>>      at$7.execute(SolrZkCl
>>>      at
>>>      at
>>>      at
>>>      ... 9 more
>>> Can I modify zookeeper to force a leader?  Is there any other way to
>>> recover from this?  Thanks very much!
>>> -Joe
>>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>>> We sometimes also have replicas not recovering. If one replica is left
>>>> active the easiest is to then to delete the replica and create a new one.
>>>> When all replicas are down it helps most of the time to restart one of the
>>>> nodes that contains a replica in down state. If that also doesn't get the
>>>> replica to recover I would check the logs of the node and also that of the
>>>> overseer node. I have seen the same issue on Solr using local storage. The
>>>> main HDFS related issues we had so far was those lock files and if you
>>>> delete and recreate collections/cores and it sometimes happens that the
>>>> data was not cleaned up in HDFS and then causes a conflict.
>>>> Hendrik
>>>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>>> We've never run an index this size in anything but HDFS, so I have no
>>>>> comparison.  What we've been doing is keeping two main collections -
>>>>> data, and the last 30 days of data.  Then we handle queries based on
>>>>> range. The 30 day index is significantly faster.
>>>>> My main concern right now is that 6 of the 100 shards are not coming
>>>>> back because of no leader.  I've never seen this error before.  Any ideas?
>>>>> ClusterStatus shows all three replicas with state 'down'.
>>>>> Thanks!
>>>>> -joe
>>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>>> We actually also have some performance issue with HDFS at the moment.
>>>>>> We are doing lots of soft commits for NRT search. Those seem to be
>>>>>> then with local storage. The investigation is however not really
far yet.
>>>>>> We have a setup with 2000 collections, with one shard each and a
>>>>>> replication factor of 2 or 3. When we restart nodes too fast that
>>>>>> problems with the overseer queue, which can lead to the queue getting
>>>>>> of control and Solr pretty much dying. We are still on Solr 6.3.
6.6 has
>>>>>> some improvements and should handle these actions faster. I would
>>>>>> what you see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json".
>>>>>> The critical part is the "overseer_queue_size" value. If this goes
up to
>>>>>> about 10000 it is pretty much game over on our setup. In that case
it seems
>>>>>> to be best to stop all nodes, clear the queue in ZK and then restart
>>>>>> nodes one by one with a gap of like 5min. That normally recovers
>>>>>> well.
>>>>>> regards,
>>>>>> Hendrik
>>>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>>> We set the hard commit time long because we were having performance
>>>>>>> issues with HDFS, and thought that since the block size is 128M,
having a
>>>>>>> longer hard commit made sense.  That was our hypothesis anyway.
Happy to
>>>>>>> switch it back and see what happens.
>>>>>>> I don't know what caused the cluster to go into recovery in the
>>>>>>> place.  We had a server die over the weekend, but it's just one
out of
>>>>>>> ~50.  Every shard is 3x replicated (and 3x replicated in
>>>>>>> copies).  It was at this point that we noticed lots of network
>>>>>>> and most of the shards in this recovery, fail, retry loop.  That
is when we
>>>>>>> decided to shut it down resulting in zombie lock files.
>>>>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>>>>> to have any effect on the shards that have no leader. Kinda out
of ideas
>>>>>>> for that problem.  If I can get the cluster back up, I'll try
a lower hard
>>>>>>> commit time. Thanks again Erick!
>>>>>>> -Joe
>>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to
>>>>>>>> ;)...
>>>>>>>> I need to back up a bit. Once nodes are in this state it's
>>>>>>>> surprising that they need to be forcefully killed. I was
>>>>>>>> thinking
>>>>>>>> about how they got in this situation in the first place.
_Before_ you
>>>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>>>> Forcefully?
>>>>>>>> Your hard commit is far longer than it needs to be, resulting
in much
>>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds
with local
>>>>>>>> disks, not quite sure whether longer intervals are helpful
on HDFS.
>>>>>>>> What this means is that you can spend up to 30 minutes when
>>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it
may not
>>>>>>>> have
>>>>>>>> had a chance to fsync the segments and may have to replay
on startup.
>>>>>>>> If you have openSearcher set to false, the hard commit operation
>>>>>>>> not horribly expensive, it just fsync's the current segments
>>>>>>>> opens
>>>>>>>> new ones. It won't be a total cure, but I bet reducing this
>>>>>>>> would help a lot.
>>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes
if you
>>>>>>>> issue a manual commit, something like
>>>>>>>> .../collection/update?commit=true. Just reducing the hard
>>>>>>>> interval will make the wait between stopping indexing and
>>>>>>>> shorter all by itself if you don't want to issue the manual
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>>>> <> wrote:
>>>>>>>>> Hi,
>>>>>>>>> the write.lock issue I see as well when Solr is not been
>>>>>>>>> gracefully.
>>>>>>>>> The write.lock files are then left in the HDFS as they
do not get
>>>>>>>>> removed
>>>>>>>>> automatically when the client disconnects like a ephemeral
node in
>>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that
it should
>>>>>>>>> be owning
>>>>>>>>> the lock as it is marked in the state stored in ZooKeeper
as the
>>>>>>>>> owner and
>>>>>>>>> is also not willing to retry, which is why you need to
restart the
>>>>>>>>> whole
>>>>>>>>> Solr instance after the cleanup. I added some logic to
my Solr
>>>>>>>>> start up
>>>>>>>>> script which scans the log files in HDFS and compares
that with the
>>>>>>>>> state in
>>>>>>>>> ZooKeeper and then delete all lock files that belong
to the node
>>>>>>>>> that I'm
>>>>>>>>> starting.
>>>>>>>>> regards,
>>>>>>>>> Hendrik
>>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>>> Hi All - we have a system with 45 physical boxes
running solr
>>>>>>>>>> 6.6.1 using
>>>>>>>>>> HDFS as the index.  The current index size is about
31TBytes. With
>>>>>>>>>> 3x
>>>>>>>>>> replication that takes up 93TBytes of disk. Our main
collection is
>>>>>>>>>> split
>>>>>>>>>> across 100 shards with 3 replicas each.  The issue
that we're
>>>>>>>>>> running into
>>>>>>>>>> is when restarting the solr6 cluster.  The shards
go into recovery
>>>>>>>>>> and start
>>>>>>>>>> to utilize nearly all of their network interfaces.
If we start too
>>>>>>>>>> many of
>>>>>>>>>> the nodes at once, the shards will go into a recovery,
fail, and
>>>>>>>>>> retry loop
>>>>>>>>>> and never come up.  The errors are related to HDFS
not responding
>>>>>>>>>> fast
>>>>>>>>>> enough and warnings from the DFSClient.  If we stop
a node when
>>>>>>>>>> this is
>>>>>>>>>> happening, the script will force a stop (180 second
timeout) and
>>>>>>>>>> upon
>>>>>>>>>> restart, we have lock files (write.lock) inside of
>>>>>>>>>> The process at this point is to start one node, find
out the lock
>>>>>>>>>> files,
>>>>>>>>>> wait for it to come up completely (hours), stop it,
delete the
>>>>>>>>>> write.lock
>>>>>>>>>> files, and restart.  Usually this second restart
is faster, but it
>>>>>>>>>> still can
>>>>>>>>>> take 20-60 minutes.
>>>>>>>>>> The smaller indexes recover much faster (less than
5 minutes).
>>>>>>>>>> Should we
>>>>>>>>>> have not used so many replicas with HDFS?  Is there
a better way
>>>>>>>>>> we should
>>>>>>>>>> have built the solr6 cluster?
>>>>>>>>>> Thank you for any insight!
>>>>>>>>>> -Joe
>>>>>>>>>> ---
>>>>>>>> This email has been checked for viruses by AVG.

View raw message