spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: major Spark performance problem
Date Sun, 09 Mar 2014 20:41:19 GMT
Hi Dana,

It’s hard to tell exactly what is consuming time, but I’d suggest starting by profiling
the single application first. Three things to look at there:

1) How many stages and how many tasks per stage is Spark launching (in the application web
UI at http://<driver>:4040)? If you have hundreds of tasks for this small a file, just
the task launching time might be a problem. You can use RDD.coalesce() to have fewer data

2) If you run a Java profiler (e.g. YourKit or hprof) on the workers while the application
is executing, where is time being spent? Maybe some of your code is more expensive than it
seems. One other thing you might find is that some code you use requires synchronization and
is therefore not scaling properly to multiple cores (e.g. Java’s Math.random() actually
does that).

3) Are there any RDDs that are used over and over but not cached? In that case they’ll be
recomputed on each use.

Once you look into these it might be easier to improve the multiple-job case. In that case
as others have pointed out, running the jobs in the same SparkContext and using the fair scheduler
( should work.


On Mar 9, 2014, at 5:56 AM, Livni, Dana <> wrote:

> YARN also have this scheduling option.
> The problem is all of our applications have the same flow where the first  stage is the
heaviest and the rest are very small.
> The problem is when some request (application) start to run on the same time, the first
stage of all is schedule in parallel, and for some reason they delay each other,
> And a stage that alone will take around 13s can reach up to 2m when running in parallel
with other identic stages  (around 15 stages).
> -----Original Message-----
> From: elyast [] 
> Sent: Friday, March 07, 2014 20:01
> To:
> Subject: Re: major Spark performance problem
> Hi,
> There is also an option to run spark applications on top of mesos in fine grained mode,
then it is possible for fair scheduling (applications will run in parallel and mesos is responsible
for scheduling all tasks) so in a sense all applications will progress in parallel, obviously
it total in may not be faster however the benefit is the fair scheduling (small jobs will
not be stuck by the big ones).
> Best regards
> Lukasz Jastrzebski
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

View raw message