thrift-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jules Cisek <ju...@luminate.com>
Subject Re: non-blocking servers are leaking sockets
Date Fri, 24 Jan 2014 00:28:05 GMT
thanks roger,

i will try and build a repro case and then patch the socket to see if it
helps.

by the way,  i got rid the client side timeout on our busiest client and
this seems to have helped.  there's a lot less of those half closed sockets
now (dozens instead of thousands) giving further credence that the server
side is leaking sockets when the client goes away unexpectedly.

this bandaid is not going to scale very long, however, since the clients
backlog (i've implemented my own timeouts using CountDownLatch but again,
this just moves the problem elsewhere).

~j




On Thu, Jan 23, 2014 at 2:45 PM, Roger Meier <roger@bufferoverflow.ch>wrote:

> Hi Jules
>
> Are you able to provide a test case on this?
>
> Did you test with a try-catch for the socketChannel_.read() within
> TNonblockingSocket.java line 141?
> You can catch a NotYetConnectedException there and do proper error
> handling.
>
> http://openjdk.java.net/projects/nio/javadoc/java/nio/channels/SocketChannel
> .html
>
> Could you try this?
>
> -roger
>
> -----Original Message-----
> From: Jules Cisek [mailto:jules@luminate.com]
> Sent: Mittwoch, 22. Januar 2014 23:33
> To: user@thrift.apache.org
> Subject: Re: non-blocking servers are leaking sockets
>
> On Wed, Jan 22, 2014 at 2:01 PM, Roger Meier <roger@bufferoverflow.ch
> >wrote:
>
> > You need to catch the IOException that was thrown by
> > TNonblockingSocket during read within your application, see here:
> >
> > https://git-wip-us.apache.org/repos/asf/thrift/repo?p=thrift.git;a=blo
> > b;f=li
> >
> > b/java/src/org/apache/thrift/transport/TNonblockingSocket.java;h=482bd
> > 149ab0
> > a993e90315e4f719d0903c89ac1f0;hb=HEAD#l140
> >
> > Thrift library does not know what to do on network issues or similar
> > issues that can cause a read to fail within your environment.
> >
>
> i don't know how i can catch the exception on the server side since the
> error is thrown outside any code path i have control over.  the try/catch
> blocks i have in my remote methods never see these network/timeout errors.
>
> the clients are too numerous and in too many different languages to fix it
> on their side.  and i submit that the server should be able to recover from
> a misbehaving client.
>
> ~j
>
>
> >
> > ;-r
> >
> > -----Original Message-----
> > From: Jules Cisek [mailto:jules@luminate.com]
> > Sent: Mittwoch, 22. Januar 2014 21:10
> > To: user@thrift.apache.org
> > Subject: Re: non-blocking servers are leaking sockets
> >
> > this service actually needs to respond in under 100ms (and usually
> > does in less than 20) so a short delay is just not possible.
> >
> > on the server, i see a lot of this in the logs:
> >
> > 14/01/22 19:15:27 WARN Thread-3 server.TThreadedSelectorServer: Got an
> > IOException in internalRead!
> > java.io.IOException: Connection reset by peer
> >         at sun.nio.ch.FileDispatcher.read0(Native Method)
> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> >         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> >         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> >         at
> >
> >
>
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:
> > 141)
> >         at
> >
> > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.interna
> > lRead(
> > AbstractNonblockingServer.java:515)
> >         at
> >
> > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(Ab
> > stract
> > NonblockingServer.java:305)
> >         at
> >
> > org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThrea
> > d.hand
> > leRead(AbstractNonblockingServer.java:202)
> >         at
> >
> > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.select
> > (TThre
> > adedSelectorServer.java:576)
> >         at
> >
> > org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.run(TT
> > hreade
> > dSelectorServer.java:536)
> >
> > (note that these resets happen when the async client doesn't get a
> > response from the server in the time set using client.setTimeout(m)
> > which in our case can be quite often and we're ok with that)
> >
> > i'm not sure why the thrift library feels it's necessary to log this
> > stuff since clients drop connections all the time and should be
> > expected to and frankly it makes me think that somehow this common
> > error is not being properly handled (although looking through the code
> > it does look like eventually the SocketChannel gets close()'ed)
> >
> > ~j
> >
> >
> >
> > On Mon, Jan 20, 2014 at 12:15 PM, Sammons, Mark
> > <mssammon@illinois.edu>wrote:
> >
> > > Hi, Jules.
> > >
> > > I'm not sure my problems are completely analogous to yours, but I
> > > had a situation where a client program making many short calls to a
> > > remote thrift server was getting a "no route to host" exception
> > > after some number of calls, and it appeared to be due to slow
> > > release of closed sockets.  I found that adding a short (20ms) delay
> > > between calls resolved the problem.
> > >
> > > I realize this is not exactly a solution, but it has at least
> > > allowed me to keep working...
> > >
> > > Regards,
> > >
> > > Mark
> > >
> > > ________________________________________
> > > From: Jules Cisek [jules@luminate.com]
> > > Sent: Monday, January 20, 2014 12:39 PM
> > > To: user@thrift.apache.org
> > > Subject: non-blocking servers are leaking sockets
> > >
> > > i'm running java TThreadedSelectorServer and THsHaServer based
> > > servers and both seem to be leaking sockets (thrift 0.9.0)
> > >
> > > googling around searching for answers i keep running into
> > > https://issues.apache.org/jira/browse/THRIFT-1653 which puts the
> > > blame on the TCP config on the server while acknowledging that
> > > perhaps a problem in the application layer does exist (see last
> > > entry)
> > >
> > > i prefer not to mess with the TCP config on the machine because it
> > > is used for various tasks, also i did not have these issues with a
> > > TThreadPoolServer and a TSocket (blocking + TBufferedTransport) or
> > > any non-thrift server on the same machine.
> > >
> > > what happens is i get a bunch of TCP connections in a CLOSE_WAIT
> > > state and these remain in that state indefinitely.  but what is even
> > > more concerning, i get many sockets that don't show up in netstat at
> > > all and only lsof can show me that they exist.  on Linux lsof shows
> > > them as "can't identify protocol".  according to
> > > https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
> > > these sockets are in a "half closed state" and the linux kernel has
> > > no idea what to do with them.
> > >
> > > i'm pretty sure there's a problem with misbehaving clients, but the
> > > server should not fall leak resources because of a client side bug.
> > >
> > > my only recourse is to run a cronjob that looks at the lsof output
> > > and restarts the server whenever the socket count gets dangerously
> > > close to "too many open files" (8192 in my case)
> > >
> > > any ideas?
> > >
> > > --
> > > jules cisek | jules@luminate.com
> > >
> >
> >
> >
> > --
> > jules cisek | jules@luminate.com
> >
> >
>
>
> --
> jules cisek | jules@luminate.com
>
>


-- 
jules cisek | jules@luminate.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message