spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Picard <jp3...@columbia.edu>
Subject Re: GraphX Connected Components
Date Wed, 30 Jul 2014 20:13:02 GMT

On Jul 30, 2014, at 5:18 AM, Ankur Dave <ankurdave@gmail.com> wrote:

> Jeffrey Picard <jp3436@columbia.edu> writes:
>> As the program runs I’m seeing each iteration take longer and longer to complete,
this seems counter intuitive to me, especially since I am seeing the shuffle read/write amounts
decrease with each iteration. I would think that as more and more vertices converged the iterations
should take a shorter amount of time. I can run on up to 150 of the 500 part files (stored
on s3) and it finishes in about 12 minutes, but with all the data I’ve let it run up to
4 hours and it still doesn’t complete.
> 
> If GraphX is running close to the cluster's memory capacity, one possibility is that
Spark is dropping part of the graph from memory and causing recomputation. The Spark web UI
will show if this is the case: the Executors page will show executors close to their memory
limit, and the storage page will show many RDDs with less than 100% cached blocks.
> 
> In that case you could allow Spark to spill partitions to disk by changing the graph's
storage level to MEMORY_AND_DISK or DISK_ONLY when you load the graph.
> 
> Ankur

Thanks Ankur, my problem does sound as you described, so I think that’s probably it.

It seems that the version of graphx I’m using doesn't have the option for setting the storage
level in the GraphLoader.edgeListFile method. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$
I tried unpersisting the edges and vertices of the graph by hand, then persisting the graph
with persist(StorageLevel.MEMORY_AND_DISK). I still see the same behavior in connected components
however, and the same thing you described in the storage page.

Storage
RDD Name	Storage Level	Cached Partitions	Fraction Cached	Size in Memory	Size in Tachyon	Size
on Disk
VertexRDD	Memory Deserialized 1x Replicated	278	56%	50.6 GB	0.0 B	0.0 B
VertexRDD	Disk Serialized 1x Replicated	498	100%	0.0 B	0.0 B	32.4 GB
VertexRDD	Memory Deserialized 1x Replicated	435	87%	79.2 GB	0.0 B	0.0 B
EdgeRDD	Memory Deserialized 1x Replicated	492	98%	273.5 GB	0.0 B	0.0 B
VertexRDD	Memory Deserialized 1x Replicated	395	79%	71.5 GB	0.0 B	0.0 B
EdgeRDD	Memory Deserialized 1x Replicated	263	53%	146.2 GB	0.0 B	0.0 B
VertexRDD	Memory Deserialized 1x Replicated	400	80%	72.8 GB	0.0 B	0.0 B
VertexRDD	Memory Deserialized 1x Replicated	179	36%	32.4 GB	0.0 B	0.0 B
EdgeRDD	Disk Serialized 1x Replicated	500	100%	0.0 B	0.0 B	96.0 GB
Would that (newer?) version of GraphX with the storage level settable in the edgeListFile
possibly solve this, or could there still be something else going on?
Mime
View raw message