jakarta-jcs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niall Gallagher <ni...@switchfire.com>
Subject JCS remote cache client shutdown behaviour
Date Mon, 07 Sep 2009 14:50:19 GMT
Hi,

I'm wondering if anyone can explain the sequence of steps the JCS client
code is supposed to follow when CompositeCacheManager.shutDown() is
called client-side? We are intermittently seeing high memory usage in
our JCS remote server, which appears to be caused by large backlogs of
event objects queued for delivery to client machines which have been
shut down, even though we are shutting down our client machines
gracefully using the method above. This is certainly aggravated by our
network's architecture, but I'm not sure if the root cause might be a
bug in JCS or I'm not understanding what should happen properly.

When we call CompositeCacheManager.shutDown() on a client machine, from
our client-side logs it appears that the dispose() method in this object
is getting called correctly for each cache region:
http://svn.apache.org/viewvc/jakarta/jcs/trunk/src/java/org/apache/jcs/auxiliary/remote/RemoteCacheListener.java?view=markup

However that method appears to just unexport the RMI RemoteCacheListener
object for each region client-side; basically terminating the
client-side end of the event delivery connection. Before disconnecting
though, shouldn't this method notify the server that the client is about
to disconnect?

Subsequently we often see errors like this in the remote server log:


07-Sep 13:52:13,347 INFO  [jcs.engine.CacheEventQueue] Error while running event from Queue:
RemoveEvent for [GAN: groupId=[groupId=<region name>, defaultGroup], attrName=<cache
key>]. Retrying...
07-Sep 13:52:13,747 WARN  [jcs.engine.CacheEventQueue] java.rmi.ConnectException: Connection
refused to host: <client machine ip address>; nested exception is:
        java.net.ConnectException: Connection refused
07-Sep 13:52:13,748 WARN  [jcs.engine.CacheEventQueue] Error while running event from Queue:
RemoveEvent for [GAN: groupId=[groupId=<region name>, defaultGroup], attrName=<cache
key>]. Dropping Event and marking Event Queue as non-functional.


...this implies the remote server continues to try to deliver events to
the JCS client which disconnected, as if the client didn't de-register
itself before disconnecting.

Perhaps I've missed something in the code.

I see that the RemoteCacheServer API (to which clients connect) does in
fact have a server-side dispose() method which (on initial
investigation) would "de-register" the client from the server's list of
event listeners. Could it be that JCS clients are simply not calling
this method?..
http://svn.apache.org/viewvc/jakarta/jcs/trunk/src/java/org/apache/jcs/auxiliary/remote/server/RemoteCacheServer.java?view=markup


This issue is a problem for us depending on which network subnet the
client machine is in. Basically our network is divided into 2 subnets,
with a fairly rubbish (or overly-strict) router/firewall between the two
subnets. This router does not relay networking errors (ICMP error
messages) between the two subnets. When a machine in one subnet goes
offline and a machine in the other subnet tries to connect to it, our
router does not notify the source machine that the target machine is
offline, and so the source machine waits indefinitely (i.e. with a
socket in the open wait state) for a response from the target machine.
On the other hand when both machines are in the same subnet, the source
machine gets a "host not reachable" exception immediately when a target
machine is offline.

Anyway... the problem is when we shut down a client machine in a
different subnet, the JCS remote server builds up a large backlog of
cache event objects, presumably trying to connect to a disconnected
client, and eventually runs out of memory. We determine this using the
JDK's jmap command - we find a large number of PutEvent and RemoveEvent
objects in the remote server's memory. We don't have the issue when both
machines are in the same subnet, but I wonder if that's because JCS
remote server is relying on the networking errors, and is de-registering
clients automatically after a certain number of failed attempts to
connect to the client. i.e. perhaps clients are not de-registering
themselves gracefully from the remote server in the first place.

Does anyone have any experience with this- anyone regularly see "Error
while running event from Queue" in the remote server logs? I realise our
network setup is partly to blame here, but perhaps the root cause is
that client's are not de-registering properly.

Many thanks in advance,

Niall

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message