This is what the web UI looks like:I also tail all the worker logs and theese are the last entries before the waiting begins:14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, minRequest: 1006632914/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 29853 non-zero-bytes blocks out of 37714 blocks14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 5 remote gets in 62 ms[PSYoungGen: 12464967K->3767331K(10552192K)] 36074093K->29053085K(44805696K), 0.6765460 secs] [Times: user=5.35 sys=0.02, real=0.67 secs][PSYoungGen: 10779466K->3203826K(9806400K)] 35384386K->31562169K(44059904K), 0.6925730 secs] [Times: user=5.47 sys=0.00, real=0.70 secs]From the screenshot above you can see that task take ~ 6 minutes to complete. The amount of time it takes the tasks to complete seems to depend on the amount of input data. If s3 input string captures 2.5 times less data (less data to shuffle write and later read), same tasks take 1 minute. Any idea how to debug what the workers are doing?DomenOn Wed, Mar 19, 2014 at 5:27 PM, Mayur Rustagi [via Apache Spark User List] <[hidden email]> wrote:
You could have some outlier task that is preventing the next set of stages from launching. Can you check out stages state in the Spark WebUI, is any task running or is everything halted.Regards
MayurOn Wed, Mar 19, 2014 at 5:40 AM, Domen Grabec <[hidden email]> wrote:
Hi,I have a cluster with 16 nodes, each node has 69Gb ram (50GB goes to spark) and 8 cores running spark 0.8.1. I have a groupByKey operation that causes a wide RDD dependency so shuffle write and shuffle read are performed.For some reason all worker threads seem to sleep for about 3-4 minutes each time performing a shuffle read and completing a set of tasks. See graphs below how no resources are being utilized in specific time windows.Each time 3-4 minutes pass, a next set of tasks are being grabbed and processed, and then another waiting period happens.Each task has an input of 80Mb +- 5Mb data to shuffle read.Here is a link to thread dump performed in the middle of the waiting period. Any idea what could cause the long waits?Kind regards, Domen
If you reply to this email, your message will be added to the discussion below:http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2882.html
View this message in context: Re: Spark worker threads waiting
Sent from the Apache Spark User List mailing list archive at Nabble.com.