spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Sparks <>
Subject Re: performance
Date Thu, 09 Jan 2014 01:28:55 GMT
On this note - the ganglia web front end that runs on the master (assuming you're launching
with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count'
on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy
evaluation that causes such diagnosis to be hard sometimes. 

- Evan

> On Jan 8, 2014, at 2:57 PM, Andrew Ash <> wrote:
> My first thought on hearing that you're calling collect is that taking all the data back
to the driver is intensive on the network.  Try checking the basic systems stuff on the machines
to get a sense of what's being heavily used:
> disk IO
> network
> Any kind of distributed system monitoring framework should be able to handle these sorts
of things.
> Cheers!
> Andrew
>> On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <> wrote:
>> Hi,
>> I have what I hope is a simple question. What's a typical approach to diagnostic
performance issues on a Spark cluster?
>> We've followed all the pertinent parts of the following document already:
>> But we seem to still have issues. More specifically we have a leftouterjoin followed
by a flatmap and then a collect running a bit long.
>> How would I go about determining the bottleneck operation(s) ? 
>> Is our leftouterjoin taking a long time? 
>> Is the function we send to the flatmap not optimized?
>> Thanks,
>> Yann

View raw message