spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhicharevich, Alex" <>
Subject RE: GraphX partition problem
Date Sun, 25 May 2014 13:38:54 GMT
Thanks Ankur<>,

Built it from git and it works great.

I have another issue now. I am trying to process a huge graph with about 20 billion edges
with GraphX. I only load the file, compute connected components and persist it right back
to disk. When working with subgraphs (with ~50M edges) this works well, but on the whole graph
it seem to choke on the graph construction part.
Can you advise on how to tune spark to process memory parameters for this task.


From: Ankur Dave []
Sent: Thursday, May 22, 2014 6:59 PM
Subject: Re: GraphX partition problem

The fix will be included in Spark 1.0, but if you just want to apply the fix to 0.9.1, here's
a hotfixed version of 0.9.1 that only includes PR #367:
You can clone and build this.


On Thu, May 22, 2014 at 4:53 AM, Zhicharevich, Alex <<>>

I’m running a simple connected components code using GraphX (version 0.9.1)

My input comes from a HDFS text file partitioned to 400 parts. When I run the code on a single
part or a small number of files (like 20) the code runs fine. As soon as I’m trying to read
more files (more than 30) I’m getting an error and the job fails.
From looking at the logs I see the following exception
                java.util.NoSuchElementException: End of stream
       at org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)
       at org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)
       at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)

From searching the web, I see it’s a known issue with GraphX
Here :
And here :

Are there some stable releases that include this fix? Should I clone the git repo and build
it myself? How would you advise me to deal with this issue


View raw message