spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <>
Subject Re: Many retries for Python job
Date Fri, 21 Nov 2014 18:41:28 GMT
Hi Brett,

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory


On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <>

> I’m running a Python script with spark-submit on top of YARN on an EMR
> cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
> from S3, and then does some transformations and filtering, followed by some
> aggregate counts.  During Stage 2 of the job, everything looks to complete
> just fine with no executor failures or resubmissions, but when Stage 3
> starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
> Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
> successfully start.  The whole application eventually completes, but there
> is an addition of about 1+ hour overhead for all of the retries.
> I’m trying to determine why there were FetchFailure exceptions, since
> anything computed in the job that could not fit in the available memory
> cache should be by default spilled to disk for further retrieval.  I also
> see some " Connection refused” and
> " sendMessageReliably failed without being ACK’d"
> errors in the logs after a CancelledKeyException followed by
> a ClosedChannelException, but I have no idea why the nodes in the EMR
> cluster would suddenly stop being able to communicate.
> If anyone has ideas as to why the data needs to be rerun several times in
> this job, please let me know as I am fairly bewildered about this behavior.

View raw message