spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tobias Bertelsen (JIRA)" <>
Subject [jira] [Updated] (SPARK-9937) GraphX Performance: Partition overhead scales quadratically
Date Thu, 13 Aug 2015 12:14:46 GMT


Tobias Bertelsen updated SPARK-9937:
    Attachment: Scaleservers-lin.png

> GraphX Performance: Partition overhead scales quadratically
> -----------------------------------------------------------
>                 Key: SPARK-9937
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>            Reporter: Tobias Bertelsen
>         Attachments: Scaleservers-lin.png
> Hello everybody, or particularly Graph X developers.
> I working on an algorithm that combines normal RDD operations and graph operations. When
I tested the parallelizability I discovered that when I added more worker nodes most stages
would run faster, but my graph operations would run slower.
> More specifically with twice the number of servers the graph operations would take twice
as long, indicating that the amount of work increased fourfold. I created a plot of the runtime
for different number of servers, which  I have attached.
> The graph operations are called clustering in the plot.
> I tried to look into the code and I think I found something that might be the problem.
> The operations shipVertexAttributes and shipVertexIds in VertexRDDImpl seems to be generating
RDD's that contains an element for every combination of vertex partition and edge partition,
even if there are no connection between the two.
> The result is that the overhead time ends up dominating the computation time.
> I am not familiar with the design and code base for Graph X. Perhaps there are more of
problems of this kind which causes parallelization problems.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message