spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <rosenvi...@gmail.com>
Subject Re: Bagel caching issues
Date Thu, 05 Dec 2013 17:31:21 GMT
The variability in task completion times could be caused by variability in
the amount of work that those tasks perform rather than slow or faulty
nodes.

For PageRank, consider a link graph contains a few disproportionately
popular webpages that have many inlinks (such as Yahoo.com).  These
high-degree nodes may cause significant communications imbalances because
they receive and send many messages in a Pregel-like model.  If you look at
the distribution of shuffled data sizes, does it exhibit similar skew to
the task completion times?

The PowerGraph paper gives a good overview of the challenges posed by these
types of large-scale natural-graphs and develops techniques to split up and
parallelize the processing of these high-degree nodes:
http://graphlab.org/powergraph-presented-at-osdi/

On Thu, Dec 5, 2013 at 6:54 AM, Mayuresh Kunjir
<mayuresh.kunjir@gmail.com>wrote:

> Thanks Jay for your response. Stragglers are a big problem here. I am
> seeing such tasks in many stages of the workflow on a consistent basis.
> It's not due to any particular nodes being slow since the slow tasks are
> observed on all the nodes at different points in time.
> The distribution of task completion times is too skewed for my liking.
> GC delays is a possible reason, but I am just speculating.
>
> ~Mayuresh
>
>
>
>
> On Thu, Dec 5, 2013 at 5:31 AM, huangjay <jayin@live.cn> wrote:
>
>> Hi,
>>
>> Maybe you need to check those nodes. It's very slow.
>>
>>
>> 3487SUCCESSPROCESS_LOCALip-10-60-150-111.ec2.internal 2013/12/01 02:11:3817.7
>> m16.3 m 23.3 MB3447SUCCESS PROCESS_LOCALip-10-12-54-63.ec2.internal2013/12/01
>> 02:11:26 20.1 m13.9 m50.9 MB
>>
>> 在 2013年12月1日,上午10:59,"Mayuresh Kunjir" <mayuresh.kunjir@gmail.com>
写道:
>>
>> I tried passing DISK_ONLY storage level to Bagel's run method. It's
>> running without any error (so far) but is too slow. I am attaching details
>> for a stage corresponding to second iteration of my algorithm. (foreach
>> at Bagel.scala:237<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/stages/stage?id=23>)
>> It's been running for more than 35 minutes. I am noticing very high GC time
>> for some tasks. Listing below the setup parameters.
>>
>> #nodes = 16
>> SPARK_WORKER_MEMORY = 13G
>> SPARK_MEM = 13G
>> RDD storage fraction = 0.5
>> degree of parallelism = 192 (16 nodes * 4 cores each * 3)
>> Serializer = Kryo
>> Vertex data size after serialization = ~12G (probably too high, but it's
>> the bare minimum required for the algorithm.)
>>
>> I would be grateful if you could suggest some further optimizations or
>> point out reasons why/if Bagel is not suitable for this data size. I need
>> to further scale my cluster and not feeling confident at all looking at
>> this.
>>
>> Thanks and regards,
>> ~Mayuresh
>>
>>
>> On Sat, Nov 30, 2013 at 3:07 PM, Mayuresh Kunjir <
>> mayuresh.kunjir@gmail.com> wrote:
>>
>>> Hi Spark users,
>>>
>>> I am running a pagerank-style algorithm on Bagel and bumping into "out
>>> of memory" issues with that.
>>>
>>> Referring to the following table, rdd_120 is the rdd of vertices,
>>> serialized and compressed in memory. On each iteration, Bagel deserializes
>>> the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120
>>> persisted in memory and disk. As iterations keep piling on, the cached
>>> partitions start getting evicted. The moment a rdd_120 partition gets
>>> evicted, it necessitates a recomputations and the performance goes for a
>>> toss. Although we don't need uncompressed rdds from previous iterations,
>>> they are the last ones to get evicted thanks to LRU policy.
>>>
>>> Should I make Bagel use DISK_ONLY persistence? How much of a performance
>>> hit would that be? Or maybe there is a better solution here.
>>>
>>> Storage
>>>  RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize
>>> on Disk rdd_83<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=83>Memory
Serialized1x Replicated2312%83.7 MB0.0 B
>>> rdd_95<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=95>Memory
Serialized1x Replicated23
>>> 12% 2.5 MB 0.0 B rdd_120<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=120>Memory
Serialized1x Replicated2513%761.1 MB0.0 B
>>> rdd_126<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=126>Disk
Memory Deserialized 1x Replicated192
>>> 100% 77.9 GB 1016.5 MB rdd_134<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=134>Disk
Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB
>>> Thanks and regards,
>>> ~Mayuresh
>>>
>>
>> <BigFrame - Details for Stage 23.htm>
>>
>>
>

Mime
View raw message