spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayuresh Kunjir <mayuresh.kun...@gmail.com>
Subject Re: Bagel caching issues
Date Thu, 05 Dec 2013 18:17:45 GMT
Thanks Josh for the excellent link. Most likely, my graph is following a
power law distribution.
There is, however, no clear correlation between shuffle read and task
duration if you look at the attached page.
Of course, shuffle read size alone does not tell us about the work carried
out by a task, does it?

Do you know if any performance comparisons have been carried out between
Pregel and Powergraph
implementations as part of GraphX work? If not, I would be happy to explore
it.

Is GraphX in a stable state to be tried out?

Regards,
~Mayuresh


On Thu, Dec 5, 2013 at 9:31 AM, Josh Rosen <rosenville@gmail.com> wrote:

> The variability in task completion times could be caused by variability in
> the amount of work that those tasks perform rather than slow or faulty
> nodes.
>
> For PageRank, consider a link graph contains a few disproportionately
> popular webpages that have many inlinks (such as Yahoo.com).  These
> high-degree nodes may cause significant communications imbalances because
> they receive and send many messages in a Pregel-like model.  If you look at
> the distribution of shuffled data sizes, does it exhibit similar skew to
> the task completion times?
>
> The PowerGraph paper gives a good overview of the challenges posed by
> these types of large-scale natural-graphs and develops techniques to split
> up and parallelize the processing of these high-degree nodes:
> http://graphlab.org/powergraph-presented-at-osdi/
>
> On Thu, Dec 5, 2013 at 6:54 AM, Mayuresh Kunjir <mayuresh.kunjir@gmail.com
> > wrote:
>
>> Thanks Jay for your response. Stragglers are a big problem here. I am
>> seeing such tasks in many stages of the workflow on a consistent basis.
>> It's not due to any particular nodes being slow since the slow tasks are
>> observed on all the nodes at different points in time.
>> The distribution of task completion times is too skewed for my liking.
>> GC delays is a possible reason, but I am just speculating.
>>
>> ~Mayuresh
>>
>>
>>
>>
>> On Thu, Dec 5, 2013 at 5:31 AM, huangjay <jayin@live.cn> wrote:
>>
>>> Hi,
>>>
>>> Maybe you need to check those nodes. It's very slow.
>>>
>>>
>>> 3487SUCCESSPROCESS_LOCALip-10-60-150-111.ec2.internal 2013/12/01
>>> 02:11:3817.7 m16.3 m 23.3 MB3447SUCCESS PROCESS_LOCAL
>>> ip-10-12-54-63.ec2.internal2013/12/01 02:11:26 20.1 m13.9 m50.9 MB
>>>
>>> 在 2013年12月1日,上午10:59,"Mayuresh Kunjir" <mayuresh.kunjir@gmail.com>
写道:
>>>
>>> I tried passing DISK_ONLY storage level to Bagel's run method. It's
>>> running without any error (so far) but is too slow. I am attaching details
>>> for a stage corresponding to second iteration of my algorithm. (foreach
>>> at Bagel.scala:237<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/stages/stage?id=23>)
>>> It's been running for more than 35 minutes. I am noticing very high GC time
>>> for some tasks. Listing below the setup parameters.
>>>
>>> #nodes = 16
>>> SPARK_WORKER_MEMORY = 13G
>>> SPARK_MEM = 13G
>>> RDD storage fraction = 0.5
>>> degree of parallelism = 192 (16 nodes * 4 cores each * 3)
>>> Serializer = Kryo
>>> Vertex data size after serialization = ~12G (probably too high, but it's
>>> the bare minimum required for the algorithm.)
>>>
>>> I would be grateful if you could suggest some further optimizations or
>>> point out reasons why/if Bagel is not suitable for this data size. I need
>>> to further scale my cluster and not feeling confident at all looking at
>>> this.
>>>
>>> Thanks and regards,
>>> ~Mayuresh
>>>
>>>
>>> On Sat, Nov 30, 2013 at 3:07 PM, Mayuresh Kunjir <
>>> mayuresh.kunjir@gmail.com> wrote:
>>>
>>>> Hi Spark users,
>>>>
>>>> I am running a pagerank-style algorithm on Bagel and bumping into "out
>>>> of memory" issues with that.
>>>>
>>>> Referring to the following table, rdd_120 is the rdd of vertices,
>>>> serialized and compressed in memory. On each iteration, Bagel deserializes
>>>> the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120
>>>> persisted in memory and disk. As iterations keep piling on, the cached
>>>> partitions start getting evicted. The moment a rdd_120 partition gets
>>>> evicted, it necessitates a recomputations and the performance goes for a
>>>> toss. Although we don't need uncompressed rdds from previous iterations,
>>>> they are the last ones to get evicted thanks to LRU policy.
>>>>
>>>> Should I make Bagel use DISK_ONLY persistence? How much of a
>>>> performance hit would that be? Or maybe there is a better solution here.
>>>>
>>>> Storage
>>>>  RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize
>>>> on Disk rdd_83<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=83>Memory
Serialized1x Replicated2312%83.7 MB0.0 B
>>>> rdd_95<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=95>Memory
Serialized1x Replicated23
>>>> 12% 2.5 MB 0.0 B rdd_120<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=120>Memory
Serialized1x Replicated2513%761.1 MB0.0 B
>>>> rdd_126<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=126>Disk
Memory Deserialized 1x Replicated192
>>>> 100% 77.9 GB 1016.5 MB rdd_134<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=134>Disk
Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB
>>>> Thanks and regards,
>>>> ~Mayuresh
>>>>
>>>
>>> <BigFrame - Details for Stage 23.htm>
>>>
>>>
>>
>

Mime
View raw message