drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kunal Khatua <kkha...@mapr.com>
Subject RE: Queries getting stuck in RUNNING state occasionally
Date Thu, 25 Jan 2018 18:37:07 GMT
Hi Lalit
Your profile hints that it is stuck in the Major Fragment 06-xx-xx, which is fed data from
16-xx-xx via 11-Exchange.

Looking at the operators’ overview and the similarity with other major fragments, only this
one seems to be stuck at completing the sort.

Could you provide the JStack on any of the nodes which are hosting fragments of 06-xx-xx ?

Thanks
Kunal

From: Lalit Mishra [mailto:lalit.mishra@mojonetworks.com]
Sent: Thursday, January 25, 2018 4:03 AM
To: user@drill.apache.org
Subject: Re: Queries getting stuck in RUNNING state occasionally

Hello Timothy,

PFA the profile file (it exceeded message limit, so I had to gzip it). Please excuse the length
of query, it is a long query unioned 5 times. I have tried to reproduce with a smaller query,
but have failed so far.

Yes, we are using MapR 6.0.

Thanks,
Lalit Mishra

On Thu, Jan 25, 2018 at 2:37 AM, Timothy Farkas <timothyfarkas@apache.org<mailto:timothyfarkas@apache.org>>
wrote:


On 2018/01/23 14:04:42, Lalit Mishra <lalit.mishra@mojonetworks.com<mailto:lalit.mishra@mojonetworks.com>>
wrote:
> Hello,
>
> We are using drill 1.11 (under yarn) on a 3 node cluster.
> Occasionally a query would remain stuck in the RUNNING state. The same
> query runs successfully on multiple occasions. I have not captured any
> information previous times this occurred, but have collected following on
> the latest occurrence -
>
>    - Full json profile
>    - Thread dumps on all three nodes
>
> I can provide these if needed.
>
> In the thread-dumps there are 107 threads tagged to the query id.
> 105 of them are stuck with following stack-trace -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
>     - waiting on <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@45083904<mailto:java.util.concurrent.ThreadPoolExecutor$Worker@45083904>
>
>
> While 2 are stuck with -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
>     - waiting on <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8<mailto:java.util.concurrent.ThreadPoolExecutor$Worker@378527f8>
>
>
> Any help with regards to figuring out what is going wrong will be
> appreciated. Thanks in advance!
>
> Thanks,
> Lalit Mishra
>
Hi Lalit,

The stack traces you provided indicate that down stream operators are waiting for data to
be sent by upstream operators which are blocked. This could mean that a scan operator is blocked
reading from a data source, or it could mean that an operator like Sort or HashAgg is getting
stuck. Can you please provide the query you are using along with the json profile?

Also please note that Apache Drill does not have YARN support yet, the PR is pending here
https://github.com/apache/drill/pull/1011<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1011&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=5S3fhzWCf4BMewMoMObRX36hSj1Nb5UbrDTA07DXmD4&e=>
. So are you using MapR's proprietary distribution of Drill?

Thanks,
Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message