spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: Cluster taking a long time with not much activity (or so I think)
Date Thu, 27 Mar 2014 05:58:39 GMT
Intermediate data could be huge before its reduced to 160M
You can look at shuffle writes of your tasks, is this the writes graph?
So intermediate data is 3TB?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 27, 2014 at 1:45 AM, Vipul Pandey <vipandey@gmail.com> wrote:

>  You can check out the storage tab of your application. If you see RDD
>> spilling off to disk that could be an issue.
>>
> Storage was just fine. The entire dataset fits into less than a TB of
> memory and I have more.
>
> Another possibility is disk commits are taking time so disk utilization
>> could be relevant.
>>
> Where do you think this will matter? While writing out the final output
>  to disk? but that's only 160M
>
>
> Here's the plot for the Memory as well.
>
>
>
> On Mar 26, 2014, at 9:54 AM, Mayur Rustagi <mayur.rustagi@gmail.com>
> wrote:
>
> Another issue could not not enough memory. Can you try out with 1TB or
> possibly 500GB data & scale gracefully.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 26, 2014 at 12:52 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> You can check out the storage tab of your application. If you see RDD
>> spilling off to disk that could be an issue.
>> Another possibility is disk commits are taking time so disk utilization
>> could be relevant.
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <vipandey@gmail.com> wrote:
>>
>>> Hi
>>>
>>> My use case is pretty simple :
>>> - Get all the data   (2TB uncompressed)
>>> - calculate some aggregates for, say, time slices (A)   - this could be
>>> every minute of every day for past 1 month
>>> - calculate some aggregates for a filtered subset of data for the same
>>> slices (B)
>>> - join them and calculate the % of B wrt A
>>> - save them to the file  (160MB)
>>>
>>> # Nodes  = 20  (150G each)
>>> Spark Version = 0.9.0
>>> Input data size  =  2TB
>>> Output Data Size = 160 M
>>>
>>>
>>> Everything else works fine but saveAsTextFile call takes about an hour.
>>> But in that hour the CPU Utilization, Load Average and Network traffic is
>>> pretty low - infact tapers off after 30 minutes. (check out the values in
>>> the plots below starting 15:08)
>>>
>>> Heres' my code
>>>
>>>       val joined =   countsForCategory.join(countsAllCategories)
>>>       val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
>>>       counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("
>>> hdfs://x.y.z/path/to/output/dir")
>>>
>>> Can someone explain what's going on? is that expected? As mentioned, my
>>> output data is pretty small.
>>>
>>>
>>>
>>>
>>>
>>> <PastedGraphic-81.png>
>>>
>>>
>>> CPU Utilization
>>> <PastedGraphic-79.png>
>>>
>>>
>>>
>>> NETWORK
>>> <PastedGraphic-78.png>
>>>
>>>
>>> LOAD AVERAGE
>>> <PastedGraphic-80.png>
>>>
>>
>>
>
>

Mime
View raw message