spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <>
Subject RE: Expert advise needed. (POC is at crossroads)
Date Fri, 01 May 2015 00:03:43 GMT
Really not expert here, but try the following ideas:
1) I assume you are using yarn, then this blog is very good about the resource tuning:
2) If 12G is a hard limit in this case, then you have no option but lower your concurrency.
Try starting set "--executor-cores=1" as first step, this will force each executor running
with one task a time. This is worst efficient for your job, but try to see if your application
can be finished without OOM.
3) Add more partitions for your RDD. For a given RDD, larger partitions means each partition
will contain less data, which requires less memory to process them, and if each one processed
by 1 core in each executor, that means you almost lower your memory requirement for executor
to the lowest level.
4) Do you cache data? Don't cache them for now, and lower "",
so less memory preserved for cache.
Since your top priority is to avoid OOM, all the above steps will make the job run slower,
or less efficient. In any case, first you should check your code logic, to see if there could
be with any improvement, but we assume your code is already optimized, as in your email. If
the above steps still cannot help your OOM, then maybe your data for one partition just cannot
fit with 12G heap, based on the logic you try to do in your code.
Date: Thu, 30 Apr 2015 18:48:12 +0530
Subject: Expert advise needed. (POC is at crossroads)

I am at crossroads now and expert advise help me decide what the next course of the project
going to be.
Background : At out company we process tons of data to help build experimentation platform.
We fire more than 300s of M/R jobs, Peta bytes of data, takes 24 hours and does lots of joins.
Its simply stupendously complex. 
POC: Migrate a small portion of processing to Spark and aim to achieve 10x gains. Today this
processing on M/R world takes 2.5 to 3 Hours. 
Data Sources: 3 (All on HDFS). Format: Two in Sequence File and one in AvroData Size:1)  64
files      169,380,175,136 bytes- Sequence

2) 101 files        84,957,259,664 bytes- Avro3) 744 files       1,972,781,123,924 bytes-
ProcessA) Map Side Join of #1 and #2B) Left Outer Join of A) and #3C) Reduce By Key of B)D)
Map Only processing of C.
Optimizations1) Converted Equi-Join to Map-Side  (Broadcast variables ) Join #A.2) Converted
groupBy + Map => ReduceBy Key #C.
I have a huge YARN (Hadoop 2.4.x) cluster at my disposal but I am limited to use only 12G
on each node.
1) My poc (after a month of crazy research, lots of Q&A on this amazing forum) runs fine
with 1 file each from above data sets and finishes in 10 mins taking 4 executors. I started
with 60 mins and got it down to 10 mins.2) For 5 files each data set it takes 45 mins and
16 executors.3) When i run against 10 files, it fails repeatedly with OOM and several timeout
errors.Configs:  --num-executors 96 --driver-memory 12g --driver-java-options "-XX:MaxPermSize=10G"
--executor-memory 12g --executor-cores 4, Spark 1.3.1

Expert AdviceMy goal is simple to be able to complete the processing at 10x to 100x speed
than M/R or show its not possible with Spark.
A) 10x to 100x1) What will it take in terms of # of executors, # of executor-cores ? &
amount of memory on each executor and some unknown magic settings that am suppose to do to
reach this goal ?2) I am attaching the code for review that can further speed up processing,
if at all its possible ?3) Do i need to do something else ?
B) Give up and wait for next amazing tech to come upGiven the steps that i have performed
so far, should i conclude that its not possible to achieve 10x to 100x gains and am stuck
with M/R world for now.
I am in need of help here. I am available for discussion at any time (day/night).
Hope i provided all the details.Regards,

To unsubscribe, e-mail:
For additional commands, e-mail: 		 	   		  
View raw message