thrift-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roger Meier" <ro...@bufferoverflow.ch>
Subject RE: non-blocking servers are leaking sockets
Date Thu, 23 Jan 2014 22:45:04 GMT
Hi Jules

Are you able to provide a test case on this?

Did you test with a try-catch for the socketChannel_.read() within
TNonblockingSocket.java line 141?
You can catch a NotYetConnectedException there and do proper error handling.
http://openjdk.java.net/projects/nio/javadoc/java/nio/channels/SocketChannel
.html

Could you try this?

-roger

-----Original Message-----
From: Jules Cisek [mailto:jules@luminate.com] 
Sent: Mittwoch, 22. Januar 2014 23:33
To: user@thrift.apache.org
Subject: Re: non-blocking servers are leaking sockets

On Wed, Jan 22, 2014 at 2:01 PM, Roger Meier <roger@bufferoverflow.ch>wrote:

> You need to catch the IOException that was thrown by 
> TNonblockingSocket during read within your application, see here:
>
> https://git-wip-us.apache.org/repos/asf/thrift/repo?p=thrift.git;a=blo
> b;f=li
>
> b/java/src/org/apache/thrift/transport/TNonblockingSocket.java;h=482bd
> 149ab0
> a993e90315e4f719d0903c89ac1f0;hb=HEAD#l140
>
> Thrift library does not know what to do on network issues or similar 
> issues that can cause a read to fail within your environment.
>

i don't know how i can catch the exception on the server side since the
error is thrown outside any code path i have control over.  the try/catch
blocks i have in my remote methods never see these network/timeout errors.

the clients are too numerous and in too many different languages to fix it
on their side.  and i submit that the server should be able to recover from
a misbehaving client.

~j


>
> ;-r
>
> -----Original Message-----
> From: Jules Cisek [mailto:jules@luminate.com]
> Sent: Mittwoch, 22. Januar 2014 21:10
> To: user@thrift.apache.org
> Subject: Re: non-blocking servers are leaking sockets
>
> this service actually needs to respond in under 100ms (and usually 
> does in less than 20) so a short delay is just not possible.
>
> on the server, i see a lot of this in the logs:
>
> 14/01/22 19:15:27 WARN Thread-3 server.TThreadedSelectorServer: Got an 
> IOException in internalRead!
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>         at
>
>
org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:
> 141)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.interna
> lRead(
> AbstractNonblockingServer.java:515)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(Ab
> stract
> NonblockingServer.java:305)
>         at
>
> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThrea
> d.hand
> leRead(AbstractNonblockingServer.java:202)
>         at
>
> org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.select
> (TThre
> adedSelectorServer.java:576)
>         at
>
> org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.run(TT
> hreade
> dSelectorServer.java:536)
>
> (note that these resets happen when the async client doesn't get a 
> response from the server in the time set using client.setTimeout(m) 
> which in our case can be quite often and we're ok with that)
>
> i'm not sure why the thrift library feels it's necessary to log this 
> stuff since clients drop connections all the time and should be 
> expected to and frankly it makes me think that somehow this common 
> error is not being properly handled (although looking through the code 
> it does look like eventually the SocketChannel gets close()'ed)
>
> ~j
>
>
>
> On Mon, Jan 20, 2014 at 12:15 PM, Sammons, Mark
> <mssammon@illinois.edu>wrote:
>
> > Hi, Jules.
> >
> > I'm not sure my problems are completely analogous to yours, but I 
> > had a situation where a client program making many short calls to a 
> > remote thrift server was getting a "no route to host" exception 
> > after some number of calls, and it appeared to be due to slow 
> > release of closed sockets.  I found that adding a short (20ms) delay 
> > between calls resolved the problem.
> >
> > I realize this is not exactly a solution, but it has at least 
> > allowed me to keep working...
> >
> > Regards,
> >
> > Mark
> >
> > ________________________________________
> > From: Jules Cisek [jules@luminate.com]
> > Sent: Monday, January 20, 2014 12:39 PM
> > To: user@thrift.apache.org
> > Subject: non-blocking servers are leaking sockets
> >
> > i'm running java TThreadedSelectorServer and THsHaServer based 
> > servers and both seem to be leaking sockets (thrift 0.9.0)
> >
> > googling around searching for answers i keep running into
> > https://issues.apache.org/jira/browse/THRIFT-1653 which puts the 
> > blame on the TCP config on the server while acknowledging that 
> > perhaps a problem in the application layer does exist (see last 
> > entry)
> >
> > i prefer not to mess with the TCP config on the machine because it 
> > is used for various tasks, also i did not have these issues with a 
> > TThreadPoolServer and a TSocket (blocking + TBufferedTransport) or 
> > any non-thrift server on the same machine.
> >
> > what happens is i get a bunch of TCP connections in a CLOSE_WAIT 
> > state and these remain in that state indefinitely.  but what is even 
> > more concerning, i get many sockets that don't show up in netstat at 
> > all and only lsof can show me that they exist.  on Linux lsof shows 
> > them as "can't identify protocol".  according to 
> > https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
> > these sockets are in a "half closed state" and the linux kernel has 
> > no idea what to do with them.
> >
> > i'm pretty sure there's a problem with misbehaving clients, but the 
> > server should not fall leak resources because of a client side bug.
> >
> > my only recourse is to run a cronjob that looks at the lsof output 
> > and restarts the server whenever the socket count gets dangerously 
> > close to "too many open files" (8192 in my case)
> >
> > any ideas?
> >
> > --
> > jules cisek | jules@luminate.com
> >
>
>
>
> --
> jules cisek | jules@luminate.com
>
>


--
jules cisek | jules@luminate.com


Mime
View raw message