Thanks for sharing the thread dump. I had a look at them and couldn't find anything unusual. Is there anything in the logs (driver + executor) that suggests what's going on? Also, what does the spark job do and what is the version of spark and hadoop you are using?

Thanks,
Aniket

On Wed, Nov 16, 2016 at 2:07 AM Michael Johnson <mjjohnson.geo@yahoo.com> wrote:
The extremely long hand/pause has started happening again. I've been running on a small remote cluster, so I used the UI to grab thread dumps rather than doing it from the command line. There seems to be one executor still alive, along with the driver; I grabbed 4 thread dumps from each, a couple of seconds apart. I'd greatly appreciate any help tracking down what's going on! (I've attached them, but I can paste them somewhere if that's more convenient.)

Thanks,
Michael




On Sunday, November 6, 2016 10:49 PM, Michael Johnson <mjjohnson.geo@yahoo.com.INVALID> wrote:


Hm. Something must have changed, as it was happening quite consistently and now I can't get it to reproduce. Thank you for the offer, and if it happens again I will try grabbing thread dumps and I will see if I can figure out what is going on.


On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <aniket.bhatnagar@gmail.com> wrote:


I doubt it's GC as you mentioned that the pause is several minutes. Since it's reproducible in local mode, can you run the spark application locally and once your job is complete (and application appears paused), can you take 5 thread dumps (using jstack or jcmd on the local spark JVM process) with 1 second delay between each dump and attach them? I can take a look.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mjjohnson.geo@yahoo.com> wrote:
Thanks; I tried looking at the thread dumps for the driver and the one executor that had that option in the UI, but I'm afraid I don't know how to interpret what I saw...  I don't think it could be my code directly, since at this point my code has all completed? Could GC be taking that long?

(I could also try grabbing the thread dumps and pasting them here, if that would help?)

On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <aniket.bhatnagar@gmail.com> wrote:


In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson <mjjohnson.geo@yahoo.com.invalid> wrote:
I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let it go, but I think one time when I did just wait, it took over an hour (and did eventually exit on its own). Any ideas on what's happening, or how to troubleshoot?

(This happens both when running locally, using the localhost mode, as well as on a small cluster with four 4-processor nodes each with 15GB of RAM; in both cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of the stages is more than 75 MB...)

Thanks,
Michael