spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darin McBeath <ddmcbe...@yahoo.com.INVALID>
Subject Problems saving a large RDD (1 TB) to S3 as a sequence file
Date Fri, 23 Jan 2015 20:04:14 GMT
I've tried various ideas, but I'm really just shooting in the dark.

I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions) I'm trying
to save off to S3 is approximately 1TB in size (with the partitions pretty evenly distributed
in size).

I just tried a test to dial back the number of executors on my cluster from using the entire
cluster (256 cores) down to 128.  Things seemed to get a bit farther (maybe) before the wheels
started spinning off again.  But, the job always fails when all I'm trying to do is save the
1TB file to S3.

I see the following in my master log file.

15/01/23 19:01:54 WARN master.Master: Removing worker-20150123172316 because we got no heartbeat
in 60 seconds
15/01/23 19:01:54 INFO master.Master: Removing worker worker-20150123172316 on 
15/01/23 19:01:54 INFO master.Master: Telling app of lost executor: 3

For the stage that eventually fails, I see the following summary information.

Summary Metrics for 729 Completed Tasks
Duration 2.5 min 4.8 min 5.5 min 6.3 min 9.2 min 
GC Time   0 ms 0.3 s 0.4 s 0.5 s 5 s 

Shuffle Read (Remote) 309.3 MB 321.7 MB 325.4 MB 329.6 MB 350.6 MB 

So, the max GC was only 5s for 729 completed tasks.  This sounds reasonable.  As people tend
to indicate GC is the reason one loses executors, this does not appear to be my case.

Here is a typical snapshot for some completed tasks.  So, you can see that they tend to complete
in approximately 6 minutes.  So, it takes about 6 minutes to write one partition to S3 (a
partition being roughly 1 GB)

65 	23619 	0 	SUCCESS 	ANY 	5 /  2015/01/23 18:30:32 	5.8 min 	0.9 s 	344.6 MB 
59 	23613 	0 	SUCCESS 	ANY 	7 /  2015/01/23 18:30:32 	6.0 min 	0.4 s 	324.1 MB 
68 	23622 	0 	SUCCESS 	ANY 	1 /  2015/01/23 18:30:32 	5.7 min 	0.5 s 	329.9 MB 
62 	23616 	0 	SUCCESS 	ANY 	6 /  2015/01/23 18:30:32 	5.8 min 	0.7 s 	326.4 MB 
61 	23615 	0 	SUCCESS 	ANY 	3 /  2015/01/23 18:30:32 	5.5 min 	1 s 	335.7 MB 
64 	23618 	0 	SUCCESS 	ANY 	2 /  2015/01/23 18:30:32 	5.6 min 	2 s 	328.1 MB 

Then towards the end, when things start heading south, I see the following.  These tasks never
complete but you can see that they have taken more than 47 minutes (so far) before the job
finally fails.  Not really sure why.

671 	24225 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
672 	24226 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
673 	24227 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
674 	24228 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
675 	24229 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
676 	24230 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
677 	24231 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
678 	24232 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
679 	24233 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:14 	47 min 
680 	24234 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:17 	47 min 
681 	24235 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:18 	47 min 
682 	24236 	0 	RUNNING 	ANY 	1 /  2015/01/23 18:59:18 	47 min 
683 	24237 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
684 	24238 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
685 	24239 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
686 	24240 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
687 	24241 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
688 	24242 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
689 	24243 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
690 	24244 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:20 	47 min 
691 	24245 	0 	RUNNING 	ANY 	5 /  2015/01/23 18:59:21 	47 min 

What's odd is that even on the same machine (see below) some tasks are still completing (in
less than 5 minutes) while other tasks on the same machine seem to be hung after 46 minutes.
 Keep in mind all I'm doing is saving the file to S3 so one would think the amount of work
per task/partition would be fairly equal.

694 	24248 	0 	SUCCESS 	ANY 	0 /  2015/01/23 18:59:32 	4.5 min 	0.3 s 	326.5 MB 
695 	24249 	0 	SUCCESS 	ANY 	0 /  2015/01/23 18:59:32 	4.5 min 	0.3 s 	330.8 MB 
696 	24250 	0 	RUNNING 	ANY 	0 /  2015/01/23 18:59:32 	46 min 
697 	24251 	0 	RUNNING 	ANY 	0 /  2015/01/23 18:59:32 	46 min 
698 	24252 	0 	SUCCESS 	ANY 	0 /  2015/01/23 18:59:32 	4.5 min 	0.3 s 	325.8 MB 
699 	24253 	0 	SUCCESS 	ANY 	0 /  2015/01/23 18:59:32 	4.5 min 	0.3 s 	325.2 MB 
700 	24254 	0 	SUCCESS 	ANY 	0 /  2015/01/23 18:59:32 	4.5 min 	0.3 s 	323.4 MB 

If anyone has some suggestions please let me know.  I've tried playing around with various
configuration options but I've found nothing yet that will fix the underlying issue.  

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message