spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Picard <jp3...@columbia.edu>
Subject GraphX Connected Components
Date Tue, 29 Jul 2014 17:27:34 GMT
Hey all,

I’m currently trying to run connected components using GraphX on a large graph (~1.8b vertices
and ~3b edges, most of them are self edges where the only edge that exists for vertex v is
v->v) on emr using 50 m3.xlarge nodes. As the program runs I’m seeing each iteration
take longer and longer to complete, this seems counter intuitive to me, especially since I
am seeing the shuffle read/write amounts decrease with each iteration. I would think that
as more and more vertices converged the iterations should take a shorter amount of time. I
can run on up to 150 of the 500 part files (stored on s3) and it finishes in about 12 minutes,
but with all the data I’ve let it run up to 4 hours and it still doesn’t complete. Does
anyone have ideas for approaches to trouble shooting this, spark parameters that might need
to be tuned, etc?

Best Regards,

Jeffrey Picard

Mime
View raw message