spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: performance
Date Wed, 08 Jan 2014 22:57:56 GMT
My first thought on hearing that you're calling collect is that taking all
the data back to the driver is intensive on the network.  Try checking the
basic systems stuff on the machines to get a sense of what's being heavily

disk IO

Any kind of distributed system monitoring framework should be able to
handle these sorts of things.


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <> wrote:

>  Hi,
>  I have what I hope is a simple question. What's a typical approach to
> diagnostic performance issues on a Spark cluster?
> We've followed all the pertinent parts of the following document already:
> But we seem to still have issues. More specifically we have a
> leftouterjoin followed by a flatmap and then a collect running a bit long.
>  How would I go about determining the bottleneck operation(s) ?
> Is our leftouterjoin taking a long time?
> Is the function we send to the flatmap not optimized?
>  Thanks,
> Yann

View raw message