spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Problems saving a large RDD (1 TB) to S3 as a sequence file
Date Sat, 24 Jan 2015 07:23:46 GMT
Can you also try increasing the akka framesize?

.set("spark.akka.frameSize","50") // Set it to a higher number


Thanks
Best Regards

On Sat, Jan 24, 2015 at 3:58 AM, Darin McBeath <ddmcbeath@yahoo.com.invalid>
wrote:

> Thanks for the ideas Sven.
>
> I'm using stand-alone cluster (Spark 1.2).
> FWIW, I was able to get this running (just now).  This is the first time
> it's worked in probably my last 10 attempts.
>
> In addition to limiting the executors to only 50% of the cluster.  In the
> settings below, I additionally added/changed  the following.  Maybe, I just
> got lucky (although I think not).  Would be good if someone could weigh in
> and agree that these changes are sensible.  I'm also hoping the support for
> placement groups (targeted for 1.3 in the ec2 scripts) will help the
> situation.  All in all, it takes about 45 minutes to write a 1 TB file back
> to S3 (as 1024 partitions).
>
>
> SparkConf conf = new SparkConf()
>     .setAppName("SparkSync Application")
>     .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>     .set("spark.rdd.compress","true")
>     .set("spark.core.connection.ack.wait.timeout","600")
>     .set("spark.akka.timeout","600")    // Increased from 300
>     .set("spark.akka.threads","16")     // Added so that default was
> increased from 4 to 16
>     .set("spark.task.maxFailures","64") // Didn't really matter as I had
> no failures in this run
>     .set("spark.storage.blockManagerSlaveTimeoutMs","300000");
>
>
> ________________________________
> From: Sven Krasser <krasser@gmail.com>
> To: Darin McBeath <ddmcbeath@yahoo.com>
> Cc: User <user@spark.apache.org>
> Sent: Friday, January 23, 2015 5:12 PM
> Subject: Re: Problems saving a large RDD (1 TB) to S3 as a sequence file
>
>
>
> Hey Darin,
>
> Are you running this over EMR or as a standalone cluster? I've had
> occasional success in similar cases by digging through all executor logs
> and trying to find exceptions that are not caused by the application
> shutdown (but the logs remain my main pain point with Spark).
>
> That aside, another explanation could be S3 throttling you due to volume
> (and hence causing write requests to fail). You can try to split your file
> into multiple pieces and store those as S3 objects with different prefixes
> to make sure they end up in different partitions in S3. See here for
> details:
> http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html.
> If that works, that'll narrow the cause down.
>
> Best,
> -Sven
>
>
>
>
>
>
> On Fri, Jan 23, 2015 at 12:04 PM, Darin McBeath
> <ddmcbeath@yahoo.com.invalid> wrote:
>
> I've tried various ideas, but I'm really just shooting in the dark.
> >
> >I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024
> partitions) I'm trying to save off to S3 is approximately 1TB in size (with
> the partitions pretty evenly distributed in size).
> >
> >I just tried a test to dial back the number of executors on my cluster
> from using the entire cluster (256 cores) down to 128.  Things seemed to
> get a bit farther (maybe) before the wheels started spinning off again.
> But, the job always fails when all I'm trying to do is save the 1TB file to
> S3.
> >
> >I see the following in my master log file.
> >
> >15/01/23 19:01:54 WARN master.Master: Removing worker-20150123172316
> because we got no heartbeat in 60 seconds
> >15/01/23 19:01:54 INFO master.Master: Removing worker
> worker-20150123172316 on
> >15/01/23 19:01:54 INFO master.Master: Telling app of lost executor: 3
> >
> >For the stage that eventually fails, I see the following summary
> information.
> >
> >Summary Metrics for 729 Completed Tasks
> >Duration 2.5 min 4.8 min 5.5 min 6.3 min 9.2 min
> >GC Time   0 ms 0.3 s 0.4 s 0.5 s 5 s
> >
> >Shuffle Read (Remote) 309.3 MB 321.7 MB 325.4 MB 329.6 MB 350.6 MB
> >
> >So, the max GC was only 5s for 729 completed tasks.  This sounds
> reasonable.  As people tend to indicate GC is the reason one loses
> executors, this does not appear to be my case.
> >
> >Here is a typical snapshot for some completed tasks.  So, you can see
> that they tend to complete in approximately 6 minutes.  So, it takes about
> 6 minutes to write one partition to S3 (a partition being roughly 1 GB)
> >
> >65      23619   0       SUCCESS         ANY     5 /  2015/01/23 18:30:32
>       5.8 min         0.9 s   344.6 MB
> >59      23613   0       SUCCESS         ANY     7 /  2015/01/23 18:30:32
>       6.0 min         0.4 s   324.1 MB
> >68      23622   0       SUCCESS         ANY     1 /  2015/01/23 18:30:32
>       5.7 min         0.5 s   329.9 MB
> >62      23616   0       SUCCESS         ANY     6 /  2015/01/23 18:30:32
>       5.8 min         0.7 s   326.4 MB
> >61      23615   0       SUCCESS         ANY     3 /  2015/01/23 18:30:32
>       5.5 min         1 s     335.7 MB
> >64      23618   0       SUCCESS         ANY     2 /  2015/01/23 18:30:32
>       5.6 min         2 s     328.1 MB
> >
> >Then towards the end, when things start heading south, I see the
> following.  These tasks never complete but you can see that they have taken
> more than 47 minutes (so far) before the job finally fails.  Not really
> sure why.
> >
> >671     24225   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >672     24226   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >673     24227   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >674     24228   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >675     24229   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >676     24230   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >677     24231   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >678     24232   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >679     24233   0       RUNNING         ANY     1 /  2015/01/23 18:59:14
>       47 min
> >680     24234   0       RUNNING         ANY     1 /  2015/01/23 18:59:17
>       47 min
> >681     24235   0       RUNNING         ANY     1 /  2015/01/23 18:59:18
>       47 min
> >682     24236   0       RUNNING         ANY     1 /  2015/01/23 18:59:18
>       47 min
> >683     24237   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >684     24238   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >685     24239   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >686     24240   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >687     24241   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >688     24242   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >689     24243   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >690     24244   0       RUNNING         ANY     5 /  2015/01/23 18:59:20
>       47 min
> >691     24245   0       RUNNING         ANY     5 /  2015/01/23 18:59:21
>       47 min
> >
> >What's odd is that even on the same machine (see below) some tasks are
> still completing (in less than 5 minutes) while other tasks on the same
> machine seem to be hung after 46 minutes.  Keep in mind all I'm doing is
> saving the file to S3 so one would think the amount of work per
> task/partition would be fairly equal.
> >
> >694     24248   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32
>       4.5 min         0.3 s   326.5 MB
> >695     24249   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32
>       4.5 min         0.3 s   330.8 MB
> >696     24250   0       RUNNING         ANY     0 /  2015/01/23 18:59:32
>       46 min
> >697     24251   0       RUNNING         ANY     0 /  2015/01/23 18:59:32
>       46 min
> >698     24252   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32
>       4.5 min         0.3 s   325.8 MB
> >699     24253   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32
>       4.5 min         0.3 s   325.2 MB
> >700     24254   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32
>       4.5 min         0.3 s   323.4 MB
> >
> >If anyone has some suggestions please let me know.  I've tried playing
> around with various configuration options but I've found nothing yet that
> will fix the underlying issue.
> >
> >Thanks.
> >
> >Darin.
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >For additional commands, e-mail: user-help@spark.apache.org
> >
> >
>
>
> --
>
> http://sites.google.com/site/krasser/?utm_source=sig
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message