spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hlib Mykhailenko <>
Subject [GRAPHX] could not process graph with 230M edges
Date Fri, 13 Mar 2015 15:09:28 GMT

I cannot process graph with 230M edges. 
I cloned apache.spark, build it and then tried it on cluster. 

I used Spark Standalone Cluster: 
-5 machines (each has 12 cores/32GB RAM) 
-'spark.executor.memory' == 25g 
-'spark.driver.memory' == 3g 

Graph has 231359027 edges. And its file weights 4,524,716,369 bytes. 
Graph is represented in text format: 
<source vertex id> <destination vertex id> 

My code: 

object Canonical { 

def main(args: Array[String]) { 

val numberOfArguments = 3 
require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments
|Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> """.stripMargin)

var graph: Graph[Int, Int] = null 
val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1) 
val partitionerName = args(1) 
val minEdgePartitions = args(2).toInt 

val sc = new SparkContext(new SparkConf() 
.setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ")


graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,

vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)

graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName)) 

After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space errors
and of course I did not get a result. 

Do I have problem in the code? Or in cluster configuration? 

Because it works fine for relatively small graphs. But for this graph it never worked. (And
I do not think that 230M edges is too big data) 

Thank you for any advise! 

Hlib Mykhailenko 
Doctorant à INRIA Sophia-Antipolis Méditerranée 
2004 Route des Lucioles BP93 

View raw message