mina-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Lecharny (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (DIRMINA-753) Some writes observed to stall until select times out
Date Wed, 13 Jan 2010 00:18:54 GMT

     [ https://issues.apache.org/jira/browse/DIRMINA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Emmanuel Lecharny resolved DIRMINA-753.
---------------------------------------

    Resolution: Fixed

Applied the proposed fix, seems to be solid. May be not perfect, but in any case, better than
what we currently have !

Many thanks John, brillant analysis !

http://svn.apache.org/viewvc?rev=898583&view=rev

> Some writes observed to stall until select times out
> ----------------------------------------------------
>
>                 Key: DIRMINA-753
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-753
>             Project: MINA
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 2.0.0-RC1
>         Environment: java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>            Reporter: John R. Fallows
>            Assignee: Emmanuel Lecharny
>             Fix For: 2.0.0-RC2
>
>
> We have observed an issue in Mina 2.0.0-RC1 where TCP writes are not flushed to the network
even though the receiver is keeping up and CPU has low utilization.
> In the test scenario, all of the clients initially connect to the server, then the server
starts writing messages to each of the clients.  The clients connect via localhost to eliminate
clock drift and physical network impact, and we measure the observable latency of the message
between sender and receiver.
> Note that the clients do not send any data to the server after messages start flowing
from the server, just ack traffic.
> When the problem occurs, some messages are unexpectedly delayed by approximately 1000ms.
> By observing a tcpdump of the wire traffic during the test, we verified that the receiver
was keeping up with message delivery, but that message delivery would sometimes stall for
approximately 1000ms after which the delayed messages would be written to the wire.
> By re-running the same test with a locally modified version of AbstractPollingIoProcessor,
such that the select() timeout was 750ms
> instead of 1000ms, we found that the messages were delayed by approximately 750ms instead.
> Further inspection of AbstractPollingIoProcessor.flush(T session) suggests a potential
race condition that would explain this behavior:
>    public final void flush(T session) {
>        boolean needsWakeup = flushingSessions.isEmpty();
>        if (scheduleFlush(session) && needsWakeup) {
>            wakeup();
>        }
>    }
> For the purposes of this description, let's say AbstractPollingIoProcessor.flush(T session)
is called by the [writer]
> thread, while Selector.select(long timeout) is called by the [processor] thread.
> [writer]       flushingSessions.isEmpty() -> false (last flushing session being processed)
> [processor] flush(currentTime), flushingSessions.poll() empties flushingSessions, completely
write last flushing session, write request queue empty, no subsequent flush scheduled
> [writer]       scheduleFlush(session) returns true, flushingSessions now non-empty, requires
wakeup(), but not called (!)
> [processor] stalls until select timeout expires, then flushes
> It seems that the intent of needsWakeup is to avoid calling wakeup() unnecessarily by
tracking when flushingSessions becomes non-empty, but it does not seem to capture the edge
case where flushSessions starts out non-empty, then becomes empty, then becomes non-empty
again, triggering the stall.
> We found that the following modification to AbstractPollingIoProcessor.flush(T session)
addressed the issue without adverse effects.
>    public final void flush(T session) {
>        if (scheduleFlush(session)) {
>            wakeup();
>        }
>    }
> The difference in behavior is as follows:
>  1. wakeup() could be called many times between calls to select()
>  2. select() might unblock one extra time when the selector is otherwise idle
> Case (1) is documented by Selector.select to be equivalent to a single call to wakeup()
between calls to select().  The assumption is that subsequent calls to wakeup() are cheap
because it is latched internally.
> Case (2) can occur as follows:
> [writer]       scheduleFlush(session) returns true
> [processor] processes all flushingSessions, including most recent addition by [writer]
> [writer]       calls wakeup()
> [processor] immediately unblocks next call to select() due to wakeup() call by [writer]
> [processor] blocks again on next call to select()
> When the [processor] unblocks select() due to wakeup() call by [writer], this is not
an _extra_ unblock when there is other work to
> do in the [processor] loop, so any additional overhead introduced by eliminating the
needsWakeup check is only observed when the selector
> is otherwise idle, and is therefore considered harmless.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message