spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Cluster taking a long time with not much activity (or so I think)
Date Wed, 26 Mar 2014 16:52:56 GMT
You can check out the storage tab of your application. If you see RDD
spilling off to disk that could be an issue.
Another possibility is disk commits are taking time so disk utilization
could be relevant.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <vipandey@gmail.com> wrote:

> Hi
>
> My use case is pretty simple :
> - Get all the data   (2TB uncompressed)
> - calculate some aggregates for, say, time slices (A)   - this could be
> every minute of every day for past 1 month
> - calculate some aggregates for a filtered subset of data for the same
> slices (B)
> - join them and calculate the % of B wrt A
> - save them to the file  (160MB)
>
> # Nodes  = 20  (150G each)
> Spark Version = 0.9.0
> Input data size  =  2TB
> Output Data Size = 160 M
>
>
> Everything else works fine but saveAsTextFile call takes about an hour.
> But in that hour the CPU Utilization, Load Average and Network traffic is
> pretty low - infact tapers off after 30 minutes. (check out the values in
> the plots below starting 15:08)
>
> Heres' my code
>
>       val joined =   countsForCategory.join(countsAllCategories)
>       val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
>       counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("
> hdfs://x.y.z/path/to/output/dir")
>
> Can someone explain what's going on? is that expected? As mentioned, my
> output data is pretty small.
>
>
>
>
>
>
>
> CPU Utilization
>
>
>
> NETWORK
>
>
> LOAD AVERAGE
>

Mime
View raw message