spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Esposito <and1...@gmail.com>
Subject Re: Incredible slow iterative computation
Date Fri, 02 May 2014 10:29:33 GMT
Sorry for the very late answer.

I carefully follow what you have pointed out and i figure out that the
structure used for each record was too big with many small objects.
Changing it the memory usage drastically decrease.

Despite that i'm still struggling with the behaviour of decreasing
performance along supersteps. Now the memory footprint is much less than
before and GC time is not noticeable anymore.
I supposed that some RDDs are recomputed and watching carefully the stages
there is evidence of that but i don't understand why it's happening.

Recalling my usage pattern:

> newRdd = oldRdd.map(myFun).persist(myStorageLevel)
>
newRdd.foreach(x => {}) // Force evaluation
>
oldRdd.unpersist(true)
>

According to my usage pattern i tried to don't unpersist the intermediate
RDDs (i.e. oldRdd) but nothing change.

Any hints? How could i debug this?



2014-04-14 12:55 GMT+02:00 Andrew Ash <andrew@andrewash.com>:

> A lot of your time is being spent in garbage collection (second image).
>  Maybe your dataset doesn't easily fit into memory?  Can you reduce the
> number of new objects created in myFun?
>
> How big are your heap sizes?
>
> Another observation is that in the 4th image some of your RDDs are massive
> and some are tiny.
>
>
> On Mon, Apr 14, 2014 at 11:45 AM, Andrea Esposito <and1989@gmail.com>wrote:
>
>> Hi all,
>>
>> i'm developing an iterative computation over graphs but i'm struggling
>> with some embarrassing low performaces.
>>
>> The computation is heavily iterative and i'm following this rdd usage
>> pattern:
>>
>> newRdd = oldRdd.map(myFun).persist(myStorageLevel)
>>>
>> newRdd.foreach(x => {}) // Force evaluation
>>> oldRdd.unpersist(true)
>>>
>>
>> I'm using a machine equips by 30 cores and 120 GB of RAM.
>> As an example i've run with a small graph of 4000 verts and 80 thousand
>> edges and the performance at the first iterations are 10+ minutes and after
>> they take lots more.
>> I attach the Spark UI screenshots of just the first 2 iterations.
>>
>> I tried with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER and also i changed
>> the "spark.shuffle.memoryFraction" to 0.3 but nothing changed (with so lot
>> of RAM for 4E10 verts these settings are quite pointless i guess).
>>
>> How should i continue to investigate?
>>
>> Any advices are very very welcome, thanks.
>>
>> Best,
>> EA
>>
>
>

Mime
View raw message