spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Issues with partitionBy: FetchFailed
Date Tue, 23 Sep 2014 00:28:40 GMT
Hi Jerry,

For the one executor hung with one CPU core running at 100%, that sounds
exactly like the symptoms I observed in
https://issues.apache.org/jira/browse/SPARK-2546 around a deadlock with the
JobConf.  The next time that happens can you take a jstack and compare with
the one on the ticket?  If it looks like this I believe you've hit
SPARK-2546

    at java.util.HashMap.transfer(HashMap.java:601)
    at java.util.HashMap.resize(HashMap.java:581)
    at java.util.HashMap.addEntry(HashMap.java:879)
    at java.util.HashMap.put(HashMap.java:505)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
    at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)


Cheers!
Andrew

On Mon, Sep 22, 2014 at 5:14 AM, Shao, Saisai <saisai.shao@intel.com> wrote:

>  I didn’t meet this issue (Too many open files) as yours, because I set a
> relative large open file numbers in Linux, like 640K. What I was seeing is
> that one executor is pausing without doing anything, all the resources are
> not fully used, and one cpu core is running into 100%, so I assume this
> process is in full gc.
>
>
>
> And the exception I met is FetchFailed, as I set a large value of “
> spark.core.connection.ack.wait.timeout”, this FetchFailed exception is
> relieved.
>
>
>
> Since in the current master branch, connection manager related code is
> under refactoring, so the behavior may be different from the previous code,
> I guess probably some potential bugs may introduced.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* David Rowe [mailto:davidrowe@gmail.com]
> *Sent:* Monday, September 22, 2014 7:12 PM
> *To:* Andrew Ash
> *Cc:* Shao, Saisai; Julien Carme; user@spark.apache.org
> *Subject:* Re: Issues with partitionBy: FetchFailed
>
>
>
> Yep, this is what I was seeing. I'll experiment tomorrow with a version
> prior to the changeset in that ticket.
>
>
>
> On Mon, Sep 22, 2014 at 8:29 PM, Andrew Ash <andrew@andrewash.com> wrote:
>
>  Hi David and Saisai,
>
>
>
> Are the exceptions you two are observing similar to the first one at
> https://issues.apache.org/jira/browse/SPARK-3633 ?  Copied below:
>
>
>
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com):
FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75,
reduceId=120)
>
>
>
> I'm seeing the same using Spark SQL on 1.1.0 -- I think there may have
> been a regression in 1.1 because the same SQL query works on the same
> cluster when back on 1.0.2
>
>
>
> Thanks!
>
> Andrew
>
>
>
> On Sun, Sep 21, 2014 at 5:15 AM, David Rowe <davidrowe@gmail.com> wrote:
>
>  Hi,
>
>
>
> I've seen this problem before, and I'm not convinced it's GC.
>
>
>
> When spark shuffles it writes a lot of small files to store the data to be
> sent to other executors (AFAICT). According to what I've read around the
> place the intention is that these files be stored in disk buffers, and
> since sync() is never called, they exist only in memory. The problem is
> when you have a lot of shuffle data, and the executors are configured to
> use, say 90% of your memory, one of those is going to be written to disk -
> either the JVM will be swapped out, or the files will be written out of
> cache.
>
>
>
> So, when these nodes are timing out, it's not a GC problem, it's that the
> machine is actually thrashing.
>
>
>
> I've had some success with this problem by reducing the amount of memory
> that the executors are configured to use from say 90% to 60%. I don't know
> the internals of the code, but I'm sure this number is related to the
> fraction of your data that's going to be shuffled to other nodes. In any
> case, it's not something that I can estimate in my own jobs very well - I
> usually have to find the right number by trial and error.
>
>
>
> Perhaps somebody who knows the internals a bit better can shed some light.
>
>
>
> Cheers
>
>
>
> Dave
>
>
>
> On Sun, Sep 21, 2014 at 9:54 PM, Shao, Saisai <saisai.shao@intel.com>
> wrote:
>
>  Hi,
>
>
>
> I’ve also met this problem before, I think you can try to set
> “spark.core.connection.ack.wait.timeout” to a large value to avoid ack
> timeout, default is 60 seconds.
>
>
>
> Sometimes because of GC pause or some other reasons, acknowledged message
> will be timeout, which will lead to this exception, you can try setting a
> large value of this configuration.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Julien Carme [mailto:julien.carme@gmail.com]
> *Sent:* Sunday, September 21, 2014 7:43 PM
> *To:* user@spark.apache.org
> *Subject:* Issues with partitionBy: FetchFailed
>
>
>
> Hello,
>
> I am facing an issue with partitionBy, it is not clear whether it is a
> problem with my code or with my spark setup. I am using Spark 1.1,
> standalone, and my other spark projects work fine.
>
> So I have to repartition a relatively large file (about 70 million lines).
> Here is a minimal version of what is not working fine:
>
> myRDD = sc.textFile("...").map { line => (extractKey(line),line) }
>
> myRepartitionedRDD = myRDD.partitionBy(new HashPartitioner(100))
>
> myRepartitionedRDD.saveAsTextFile(...)
>
> It runs quite some time, until I get some errors and it retries. Errors
> are:
>
> FetchFailed(BlockManagerId(3,myWorker2, 52082,0),
> shuffleId=1,mapId=1,reduceId=5)
>
> Logs are not much more infomrative. I get:
>
> Java.io.IOException : sendMessageReliability failed because ack was not
> received within 60 sec
>
>
>
> I get similar errors with all my workers.
>
> Do you have some kind of explanation for this behaviour? What could be
> wrong?
>
> Thanks,
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message