mina-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Lecharny (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DIRMINA-1107) SslHandler flushScheduledEvents race condition, redux
Date Fri, 17 May 2019 14:43:00 GMT

    [ https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842227#comment-16842227
] 

Emmanuel Lecharny commented on DIRMINA-1107:
--------------------------------------------

Continuing my archeocodology, here are the situations where it's possible to write a message
in the IoHandler while the handshake has not been completed :
o for a server: just when the session is created, and if the {{SSLFilter}} is added to the
session's chain, and if the handshake is started immediately, then the server will expect
the next received message to be a {{CLIENT_HELLO}}. However, after having injected the filter
in the chain, and created the {{SslHandler}} instance, and having started the handshake in
server mode, then a {{sessionCreated}} followed by a {{sessionOpened}} events are called,
which potentially gives the application to call a {{session.write(message)}}. Obviously, this
message should *not* go through the {{SSLFilter}} before the handshake has been completed,
it must be enqueued, waiting for the handshake to complete (or fail).
o If the handshake takes a while, and if the session idling detection is activated (every
second), the it's possible that the {{IoHandler}} gets called through the {{sessionIdling}}
event, giving the opportunity for the application to write a message (ok, that is a remote
possibility)

Just writing that down for the sake of documentation...

> SslHandler flushScheduledEvents race condition, redux
> -----------------------------------------------------
>
>                 Key: DIRMINA-1107
>                 URL: https://issues.apache.org/jira/browse/DIRMINA-1107
>             Project: MINA
>          Issue Type: Bug
>    Affects Versions: 2.1.2
>            Reporter: Guus der Kinderen
>            Priority: Major
>             Fix For: 2.1.3
>
>
> DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally replaces it
with another multithreading issue.
> The fix for DIRMINA-1019 introduces a counter that contains the number of events to be
processed. A simplified version of the code is included below.
> {code:java}
> private final AtomicInteger scheduledEvents = new AtomicInteger(0);
> void flushScheduledEvents() {
>     scheduledEvents.incrementAndGet();
>     if (sslLock.tryLock()) {            
>         try {
>             do {
>                 while ((event = filterWriteEventQueue.poll()) != null) {
>                     // ...
>                 }
>             
>                 while ((event = messageReceivedEventQueue.poll()) != null){
>                     // ...
>                 }
>             } while (scheduledEvents.decrementAndGet() > 0);
>         } finally {
>             sslLock.unlock();
>         }
>     }
> }{code}
> We have observed occasions where the value of {{scheduledEvents}} becomes a negative
value, while at the same time {{filterWriteEventQueue}} go unprocessed.
> We suspect that this issue is triggered by a concurrency issue caused by the first thread
decrementing the counter after a second thread incremented it, but before it attempted to
acquire the lock.
> This allows the the first thread to empty the queues, decrementing the counter to zero
and release the lock, after which the second thread acquires the lock successfully. Now, the
second thread processes any elements in {{filterWriteEventQueue}}, and then processes any
elements in {{messageReceivedEventQueue}}. If in between these two checks yet another thread
adds a new element to {{filterWriteEventQueue}}, this element can go unprocessed (as the second
thread does not loop, since the counter is zero or negative, and the third thread can fail
to acquire the lock).
> It's a seemingly unlikely scenario, but we are observing the behavior when our systems
are under high load.
> We've applied a code change after which this problem is no longer observed. We've removed
the counter, and check on the size of the queues instead:
> {code:java}
> void flushScheduledEvents() {
>     if (sslLock.tryLock()) {            
>         try {
>             do {
>                 while ((event = filterWriteEventQueue.poll()) != null) {
>                     // ...
>                 }
>             
>                 while ((event = messageReceivedEventQueue.poll()) != null){
>                     // ...
>                 }
>             } while (!filterWriteEventQueue.isEmpty() || !messageReceivedEventQueue.isEmpty());
>         } finally {
>             sslLock.unlock();
>         }
>     }
> }{code}
> This code change, as illustrated above, does introduce a new potential problem. Theoretically,
an event could be added to the queues and {{flushScheduledEvents}} be called returning {{false}}
for {{sslLock.tryLock()}}, exactly after another thread just finished the {{while}} loop,
but before releasing the lock. This again would cause events to go unprocessed.
> We've not observed this problem in the wild yet, but we're uncomfortable applying this
change as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message