spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Long, Andrew" <loand...@amazon.com.INVALID>
Subject Re: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB
Date Wed, 01 May 2019 19:33:17 GMT
It turned out that I was unintentionally copying multiple copies of the Hadoop config to every
partition in an rdd. >.<  I was able to debug this by setting a break point on the warning
message and inspecting the partition object itself.

Cheers Andrew

From: Russell Spitzer <russell.spitzer@gmail.com>
Date: Thursday, April 25, 2019 at 8:47 AM
To: "Long, Andrew" <loandrew@amazon.com.invalid>
Cc: dev <dev@spark.apache.org>
Subject: Re: FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended
task size is 100 KB

I usually only see that in regards to folks parallelizing very large objects. From what I
know, it's really just the data inside the "Partition" class of the RDD that is being sent
back and forth. So usually something like spark.parallelize(Seq(reallyBigMap)) or something
like that. The parallelize function jams all that data into the RDD's Partition metadata so
that can easily overwhelm the task size.

On Tue, Apr 23, 2019 at 3:57 PM Long, Andrew <loandrew@amazon.com.invalid> wrote:
Hey Friends,

Is there an easy way of figuring out whats being pull into the task context?  I’ve been
getting the following message which I suspect means I’ve unintentional caught some large
objects but figuring out what those objects are is stumping me.

19/04/23 13:52:13 WARN org.apache.spark.internal.Logging$class TaskSetManager: Stage 152 contains
a task of very large size (12747 KB). The maximum recommended task size is 100 KB

Cheers Andrew
Mime
View raw message