drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sorabh Hamirwasia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-6746) Query can hang when PartitionSender task thread sees a connection failure while sending data batches to remote fragment
Date Wed, 19 Sep 2018 18:39:00 GMT
Sorabh Hamirwasia created DRILL-6746:
----------------------------------------

             Summary: Query can hang when PartitionSender task thread sees a connection failure
while sending data batches to remote fragment
                 Key: DRILL-6746
                 URL: https://issues.apache.org/jira/browse/DRILL-6746
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Flow
    Affects Versions: 1.13.0
            Reporter: Sorabh Hamirwasia
            Assignee: Sorabh Hamirwasia
             Fix For: 1.15.0


An UnorderedMuxExchange is implemented using UnorderedReceiver and HashPartitionSender. Muxer
is used to improve the memory usage, such that when multiple minor fragments (let say n) running
on a node is sending data to multiple other remote nodes minor fragments (let say m), then
each sending fragment has to create m buffers for m receivers. In total on a single node that
means creating mn buffers. Whereas with use of muxer what can be done is all the data from
m minor fragment can be sent to 1 local minor fragment and that local minor fragment will
instead send data to m remote fragments/nodes. Hence total buffer to use will be m only. 

There is a shared queue which is filled with RecordBatches by all the m sending minor fragments
(which is received on Data client channel and ultimately populated by netty thread) and then
it is consumed by local minor fragment which has PartitionSender—> UnorderedReceiver
with each next() call. Hence the queue is filled and consumed by different thread. When PartitionSender
receives an incoming batch then based on some heuristics it creates multiple PartitionTasks
threads which all goes over this incoming batch and populates rows that falls in their range
to their outgoing batch. The main local minor fragment thread waits until all task thread
is completed or in an event of interrupt. After which it gets next() incoming batch. In this
process once the output batch is full then it's sent to the remote nodes. All the sends are
done asynchronously.

In this case while sending the outgoing batch by task thread if there is any failure then
the executor state of the main local fragment thread (running partitionSender and Unordered
receiver) is set to FAILED state asynchronously. Meanwhile next() call is made to get new
incoming batch. There is a race condition between the check of executor thread state with
next() call and when the FAILED state is set. Hence next() can be called before state is actually
updated. With this next() call if there is no RecordBatch present in the queue then the main
local fragment thread will call take() on buffer queue and will wait until it get's a new
batch. Meanwhile the executor state might get updated and the the netty thread which receives
the batch and tries to enqueue it in queue will see the updated state and release the received
batch without putting it in shared queue. Since no new batch will be stored in shared queue
going forward the main local minor fragment thread will be stuck forever unless a cancellation
is explicitly done which will interrupt the stuck thread. This can result in query hang.

*Logs for above investigation:*

It looks like the intermediate fragment 2:3 started executing when it receives the record
batch.
{code:java}
drill-cluster/drillbit.1.log:2018-08-08 00:27:34,210 [249580bc-5bca-b166-e906-084b35ecf30d:frag:2:3]
INFO  o.a.d.e.w.fragment.FragmentExecutor - 249580bc-5bca-b166-e906-084b35ecf30d:2:3: State
change requested AWAITING_ALLOCATION --> RUNNING

drill-cluster/drillbit.1.log:2018-08-08 00:27:34,210 [249580bc-5bca-b166-e906-084b35ecf30d:frag:2:3]
INFO  o.a.d.e.w.f.FragmentStatusReporter - 249580bc-5bca-b166-e906-084b35ecf30d:2:3: State
to report: RUNNING
{code}
But later while sending a record batch downstream to a remote node it receives failure as
Data Connection is not established since remote Drillbit was not running then.
{code:java}
2018-08-08 00:33:29,184 [BitClient-7] ERROR o.a.d.e.rpc.ConnectionMultiListener - Failed to
establish connection
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: drillbit2/10.10.10.10:31012
        at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54) ~[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.operationComplete(ConnectionMultiListener.java:90)
[drill-rpc-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.operationComplete(ConnectionMultiListener.java:77)
[drill-rpc-1.13.0.jar:1.13.0]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) [netty-common-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:278)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:294)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) [netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) [netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
[netty-common-4.0.48.Final.jar:4.0.48.Final]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_144]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused:
drillbit2/10.10.10.10:31012
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_144]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[na:1.8.0_144]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:258)
~[netty-transport-4.0.48.Final.jar:4.0.48.Final]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
[netty-transport-4.0.48.Final.jar:4.0.48.Final]
        ... 6 common frames omitted
Caused by: java.net.ConnectException: Connection refused
        ... 10 common frames omitted

{code}
Then due to this failure while *sendingRecordBatch*, the status handler set’s the Fragment
Executor state to FAILED but cleanup is not performed as that will be done when main next()
loop of operators is terminated.
{code:java}
2018-08-08 00:33:29,184 [BitClient-7] INFO  o.a.d.e.w.fragment.FragmentExecutor - 249580bc-5bca-b166-e906-084b35ecf30d:2:3:
State change requested RUNNING --> FAILED
{code}
Ideally the operators should have seen the executor state is FAILED and should have started
the cleanup process of FragmentExecutor. During cleanup the FragmentExecutor removes itself
from *runningFragments* list and also removes its *FragmentManager* (being an intermediate
Fragment). But looks like somewhere the executor thread was stuck and cleanup was not completed.
The reason is because after 8 hours when cancellation was requested the Fragment Manager for
this minor fragment was still found on that node and cancellation was performed properly.
{code:java}
2018-08-08 08:51:23,207 [BitServer-1] INFO  o.a.d.e.w.fragment.FragmentExecutor - 249580bc-5bca-b166-e906-084b35ecf30d:2:3:
State change requested FAILED --> CANCELLATION_REQUESTED
2018-08-08 08:51:23,207 [BitServer-1] WARN  o.a.d.e.w.fragment.FragmentExecutor - 249580bc-5bca-b166-e906-084b35ecf30d:2:3:
Ignoring unexpected state transition FAILED --> CANCELLATION_REQUESTED
2018-08-08 08:51:23,209 [249580bc-5bca-b166-e906-084b35ecf30d:frag:2:3] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 249580bc-5bca-b166-e906-084b35ecf30d:2:3: State change requested FAILED --> FINISHED

2018-08-08 08:51:23,216 [249580bc-5bca-b166-e906-084b35ecf30d:frag:2:3] ERROR o.a.d.e.w.fragment.FragmentExecutor
- SYSTEM ERROR: ConnectException: Connection refused

Fragment 2:3

[Error Id: 3f66fbfb-a93b-4b48-8008-ebb78e084905 on drillbit1:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: ConnectException: Connection
refused

Fragment 2:3

[Error Id: 3f66fbfb-a93b-4b48-8008-ebb78e084905 on drillbit1:31010]
        at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633)
~[drill-common-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:300)
[drill-java-exec-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
[drill-java-exec-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:266)
[drill-java-exec-1.13.0.jar:1.13.0]
        at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.13.0.jar:1.13.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[na:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_144]
Caused by: org.apache.drill.exec.rpc.RpcException: Command failed while establishing connection.
 Failure type CONNECTION.
        at org.apache.drill.exec.rpc.RpcException.mapException(RpcException.java:67) ~[drill-rpc-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.rpc.ListeningCommand.connectionFailed(ListeningCommand.java:66)
~[drill-rpc-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.rpc.data.DataTunnel$SendBatchAsyncListen.connectionFailed(DataTunnel.java:145)
~[drill-java-exec-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.connectionFailed(ReconnectingConnection.java:152)
~[drill-rpc-1.13.0.jar:1.13.0]
        at org.apache.drill.exec.rpc.ConnectionMultiListener$ConnectionHandler.operationComplete(ConnectionMultiListener.java:119)
~[drill-rpc-1.13.0.jar:1.13.0]
        at 
Caused by: java.net.ConnectException: Connection refused
        ... 10 common frames omitted
{code}
Cleanup can be stuck for 8 hours (before cancellation was triggered) because of below reasons.
 # This intermediate Fragment was running Unordered Receiver and PartitionSender. So if there
is a logic in PartitionSender such that in some case it will keep waiting for next incoming
batch without checking the fragment executor state using (shouldContinue()) then we can get
stuck. Since only after checking shouldContinue() state it will know that it has to exit and
not wait for any more batches. Still need to look into PartitionSender code and understand
it.

Also I looked into profile and based on my understanding major fragment 2's unordered receiver
should get all the batches from major fragment 4 single sender. When I add batches in single
sender to receiver there is mismatch. Basically sender has sent more batches than what receiver
has received:

*Sender Stats:*
 Batches: 113,528
 Row: 62,554,675

*Receiver Stats:*
 Batches: 111,372
 Row: 61,137,838



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message