Hi Mitch,

I think it is normal. The network utilization will be high when there is some shuffling process happening. After that, the network utilization should come down, while each slave nodes will do the computation on the partitions assigned to them. At least it is my understanding.

Best,
Julaiti


On Tue, Mar 3, 2015 at 2:32 AM, Mitch Gusat <mgusat@gmail.com> wrote:
Hi Julaiti,

Have you made progress in discovering the bottleneck below?

While i suspect a configuration setting or program bug, i'm intrigued by "network utilization is high for several seconds at the beginning, then drop close to 0"... Do you know more?

thanks,
Mitch Gusat (IBM research)

On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalafate@eng.ucsd.edu>
wrote:

> Hi there,
>
> I am trying to scale up the data size that my application is handling.
> This application is running on a cluster with 16 slave nodes. Each slave
> node has 60GB memory. It is running in standalone mode. The data is coming
> from HDFS that also in same local network.
>
> In order to have an understanding on how my program is running, I also had
> a Ganglia installed on the cluster. From previous run, I know the stage
> that taking longest time to run is counting word pairs (my RDD consists of
> sentences from a corpus). My goal is to identify the bottleneck of my
> application, then modify my program or hardware configurations according to
> that.
>
> Unfortunately, I didn't find too much information on Spark monitoring and
> optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
> application tuning from tasks perspective. Basically, his focus is on tasks
> that oddly slower than the average. However, it didn't solve my problem
> because there is no such tasks that run way slow than others in my case.
>
> So I tried to identify the bottleneck from hardware prospective. I want to
> know what the limitation of the cluster is. I think if the executers are
> running hard, either CPU, memory or network bandwidth (or maybe the
> combinations) is hitting the roof. But Ganglia reports the CPU utilization
> of cluster is no more than 50%, network utilization is high for several
> seconds at the beginning, then drop close to 0. From Spark UI, I can see
> the nodes with maximum memory usage is consuming around 6GB, while
> "spark.executor.memory" is set to be 20GB.
>
> I am very confused that the program is not running fast enough, while
> hardware resources are not in shortage. Could you please give me some hints
> about what decides the performance of a Spark application from hardware
> perspective?
>
> Thanks!
>
> Julaiti