hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Downing <tdown...@proteus-technologies.com>
Subject Re: High ingest rate and FIN_WAIT1 problems
Date Tue, 20 Jul 2010 17:15:47 GMT
Yes, hadoop 0.20.2 and hbase 0.20.5.

I will get the branch you suggest, and give it a whirl.  I am leaving on
vacation Thursday, so I may not have any results to report till I get
back.

When I do get back, I will catch up with versions/fixes and try some
more.

Meanwhile, thanks to all who have responded to my posts.

thomas downing

On 7/20/2010 1:06 PM, Stack wrote:
> Hey Thomas:
>
> You are using hadoop 0..20.2 or something?  And hbase 0.20.5 or so?
>
> You might try http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/.
>   In particlular, it has HDFS-1118 "Fix socketleak on DFSClient".
>
> St.Ack
>
> On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing
> <tdowning@proteus-technologies.com>  wrote:
>    
>> Yes, I did try the timeout of 0.  As expected, I did not see sockets
>> in FIN_WAIT2 or TIME_WAIT for very long.
>>
>> I still leak sockets at the ingest rates I need - the FIN_WAIT1
>> problem.  Also, with the more careful observations this time around,
>> I noted that even before the FIN_WAIT1 problem starts to crop
>> up (at around 1600M inserts) there is already a slower socket
>> leakage with timeout=0 and no FIN_WAIT1 problem.  At 100M
>> sockets were hovering around 50-60, by 800M they were around
>> 200, and at 1600M they were at 400.  This is slower than without
>> the timeout set to 0 (about half the rate), but it is still ultimately
>> fatal.
>>
>> This socket increase is all between hbase and hadoop, none
>> between test client and hbase.
>>
>> While the FIN_WAIT1 problem is triggered by an hbase side
>> issue, I have no indication of which side causes this other leak.
>>
>> thanks
>>
>> thomas downing
>>
>> On 7/19/2010 4:31 PM, Ryan Rawson wrote:
>>      
>>> Did you try the setting I suggested?  There is/was a known bug in HDFS
>>> which can cause issues which may include "abandoned" sockets such as
>>> you are describing.
>>>
>>> -ryan
>>>
>>> On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing
>>> <tdowning@proteus-technologies.com>    wrote:
>>>
>>>        
>>>> Thanks for the response, but my problem is not with FIN_WAIT2, it
>>>> is with FIN_WAIT1.
>>>>
>>>> If it was FIN_WAIT2, the only concern would be socket leakage,
>>>> and if  setting the time out solved the issue, that would be great.
>>>>
>>>> The problem with FIN_WAIT1 is twofold - first, it is incumbent on
>>>> the application to notice and handle this problem; from the TCP stack
>>>> point of view, there is nothing wrong.  It is just a special case of slow
>>>> consumer.  The other problem is that it implies that something will be
>>>> lost if the socket is abandoned, there is data in the send queue of the
>>>> socket in FIN_WAIT1 that has not yet been delivered to the peer.
>>>>
>>>> On 7/16/2010 3:56 PM, Ryan Rawson wrote:
>>>>
>>>>          
>>>>> I've been running with this setting on both the HDFS side and the
>>>>> HBase side for over a year now, it's a bit of voodoo but you might be
>>>>> running into well known suckage of HDFS.  Try this one and restart
>>>>> your hbase&      hdfs.
>>>>>
>>>>> The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so
>>>>> much for inserts.
>>>>>
>>>>> <property>
>>>>> <name>dfs.datanode.socket.write.timeout</name>
>>>>> <value>0</value>
>>>>> </property>
>>>>>
>>>>> -ryan
>>>>>
>>>>>
>>>>> On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing
>>>>> <tdowning@proteus-technologies.com>      wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> Thanks for the response.
>>>>>>
>>>>>> My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2,
>>>>>> my problem is with FIN_WAIT1.
>>>>>>
>>>>>> While I do see some sockets in TIME_WAIT, they are only a few, and
the
>>>>>> number is not growing.
>>>>>>
>>>>>> On 7/16/2010 12:07 PM, Hegner, Travis wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> I ran into a very similar issue when running slony-I on postgresql
to
>>>>>>> replicate 15-20 databases.
>>>>>>>
>>>>>>> Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help
to
>>>>>>> slow
>>>>>>> (or hopefully stop), the leaking sockets. I found some notes
about
>>>>>>> adjusting
>>>>>>> TCP parameters here:
>>>>>>> http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>> [snip]
>>>>
>>>>
>>>>          
>>> --
>>> Follow this link to mark it as spam:
>>>
>>> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD
>>>
>>>
>>>
>>>        
>>
>>      
> --
> Follow this link to mark it as spam:
> http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=2E38F27E96.A72CF
>
>
>    


Mime
View raw message