spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Larry Xiao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3523) GraphX graph partitioning strategy
Date Tue, 02 Dec 2014 16:49:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
] 

Larry Xiao commented on SPARK-3523:
-----------------------------------

Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be change, because
default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> ----------------------------------
>
>                 Key: SPARK-3523
>                 URL: https://issues.apache.org/jira/browse/SPARK-3523
>             Project: Spark
>          Issue Type: Improvement
>          Components: GraphX
>    Affects Versions: 1.0.2
>            Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And find the
partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent
to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the workload on
few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus should
exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance (measured using
CV, coefficient of variation) of either vertices or edges of all partitions. The balance of
edges can infer computation balance, and the balance of vertices can infer communication balance.
> * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

>   * This is an example of a balanced partitioning achieving 20% saving on communication.
> * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with PartitionStrategy.scala,
so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message