The fix will be included in Spark 1.0, but if you just want to apply the fix to 0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can clone and build this.

Ankur


On Thu, May 22, 2014 at 4:53 AM, Zhicharevich, Alex <azhicharevich@ebay.com> wrote:

Hi,

 

I’m running a simple connected components code using GraphX (version 0.9.1)

 

My input comes from a HDFS text file partitioned to 400 parts. When I run the code on a single part or a small number of files (like 20) the code runs fine. As soon as I’m trying to read more files (more than 30) I’m getting an error and the job fails.

From looking at the logs I see the following exception

                java.util.NoSuchElementException: End of stream

       at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)

       at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29)

       at org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)

       at org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)

       at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)

 

From searching the web, I see it’s a known issue with GraphX

Here : https://github.com/apache/spark/pull/367

And here : https://github.com/apache/spark/pull/497

 

Are there some stable releases that include this fix? Should I clone the git repo and build it myself? How would you advise me to deal with this issue

 

Thanks,

Alex