Thanks Jay for your response. Stragglers are a big problem here. I am seeing such tasks in many stages of the workflow on a consistent basis.It's not due to any particular nodes being slow since the slow tasks are observed on all the nodes at different points in time.The distribution of task completion times is too skewed for my liking.GC delays is a possible reason, but I am just speculating.~Mayuresh
On Thu, Dec 5, 2013 at 5:31 AM, huangjay <firstname.lastname@example.org> wrote:Hi,Maybe you need to check those nodes. It's very slow.
SUCCESS PROCESS_LOCAL ip-10-60-150-111.ec2.internal 2013/12/01 02:11:38 17.7 m 16.3 m 23.3 MB
3447 SUCCESS PROCESS_LOCAL ip-10-12-54-63.ec2.internal 2013/12/01 02:11:26 20.1 m 13.9 m 50.9 MB
在 2013年12月1日，上午10:59，"Mayuresh Kunjir" <email@example.com> 写道：I tried passing DISK_ONLY storage level to Bagel's run method. It's running without any error (so far) but is too slow. I am attaching details for a stage corresponding to second iteration of my algorithm. (foreach at Bagel.scala:237) It's been running for more than 35 minutes. I am noticing very high GC time for some tasks. Listing below the setup parameters.#nodes = 16SPARK_WORKER_MEMORY = 13GSPARK_MEM = 13GRDD storage fraction = 0.5degree of parallelism = 192 (16 nodes * 4 cores each * 3)Serializer = KryoVertex data size after serialization = ~12G (probably too high, but it's the bare minimum required for the algorithm.)I would be grateful if you could suggest some further optimizations or point out reasons why/if Bagel is not suitable for this data size. I need to further scale my cluster and not feeling confident at all looking at this.Thanks and regards,~MayureshOn Sat, Nov 30, 2013 at 3:07 PM, Mayuresh Kunjir <firstname.lastname@example.org> wrote:Hi Spark users,I am running a pagerank-style algorithm on Bagel and bumping into "out of memory" issues with that.Referring to the following table, rdd_120 is the rdd of vertices, serialized and compressed in memory. On each iteration, Bagel deserializes the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120 persisted in memory and disk. As iterations keep piling on, the cached partitions start getting evicted. The moment a rdd_120 partition gets evicted, it necessitates a recomputations and the performance goes for a toss. Although we don't need uncompressed rdds from previous iterations, they are the last ones to get evicted thanks to LRU policy.Should I make Bagel use DISK_ONLY persistence? How much of a performance hit would that be? Or maybe there is a better solution here.
RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory Size on Disk rdd_83 Memory Serialized1x Replicated 23 12% 83.7 MB 0.0 B rdd_95 Memory Serialized1x Replicated 23 12% 2.5 MB 0.0 B rdd_120 Memory Serialized1x Replicated 25 13% 761.1 MB 0.0 B rdd_126 Disk Memory Deserialized 1x Replicated 192 100% 77.9 GB 1016.5 MB rdd_134 Disk Memory Deserialized 1x Replicated 185 96% 60.8 GB 475.4 MBThanks and regards,~Mayuresh<BigFrame - Details for Stage 23.htm>