spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Cluster taking a long time with not much activity (or so I think)
Date Wed, 26 Mar 2014 16:54:22 GMT
Another issue could not not enough memory. Can you try out with 1TB or
possibly 500GB data & scale gracefully.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 26, 2014 at 12:52 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:

> You can check out the storage tab of your application. If you see RDD
> spilling off to disk that could be an issue.
> Another possibility is disk commits are taking time so disk utilization
> could be relevant.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <vipandey@gmail.com> wrote:
>
>> Hi
>>
>> My use case is pretty simple :
>> - Get all the data   (2TB uncompressed)
>> - calculate some aggregates for, say, time slices (A)   - this could be
>> every minute of every day for past 1 month
>> - calculate some aggregates for a filtered subset of data for the same
>> slices (B)
>> - join them and calculate the % of B wrt A
>> - save them to the file  (160MB)
>>
>> # Nodes  = 20  (150G each)
>> Spark Version = 0.9.0
>> Input data size  =  2TB
>> Output Data Size = 160 M
>>
>>
>> Everything else works fine but saveAsTextFile call takes about an hour.
>> But in that hour the CPU Utilization, Load Average and Network traffic is
>> pretty low - infact tapers off after 30 minutes. (check out the values in
>> the plots below starting 15:08)
>>
>> Heres' my code
>>
>>       val joined =   countsForCategory.join(countsAllCategories)
>>       val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
>>       counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("
>> hdfs://x.y.z/path/to/output/dir")
>>
>> Can someone explain what's going on? is that expected? As mentioned, my
>> output data is pretty small.
>>
>>
>>
>>
>>
>>
>>
>> CPU Utilization
>>
>>
>>
>> NETWORK
>>
>>
>> LOAD AVERAGE
>>
>
>

Mime
View raw message