spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhang juntao <>
Subject spark graphx storage RDD memory leak
Date Sun, 10 Apr 2016 16:08:59 GMT
hi experts,

I’m reporting a problem about spark graphx, I use zeppelin submit spark jobs, 
note that scala environment shares the same SparkContext, SQLContext instance,
and I call  Connected components algorithm to do some Business,  
found that every time when the job finished, some graph storage RDDs weren’t bean released,

after several times there would be a lot of  storage RDDs existing even through all the jobs
have finished . 

So I check the code of connectedComponents  and find that may be a problem in Pregel.scala
when param graph has been cached, there isn’t any way to unpersist,  
so I add red font code to solve the problem
def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
   (graph: Graph[VD, ED],
    initialMsg: A,
    maxIterations: Int = Int.MaxValue,
    activeDirection: EdgeDirection = EdgeDirection.Either)
   (vprog: (VertexId, VD, A) => VD,
    sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
    mergeMsg: (A, A) => A)
  : Graph[VD, ED] =
  var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
  graph.unpersistVertices(blocking = false)
  graph.edges.unpersist(blocking = false)

} // end of apply

I'm not sure if this is a bug, 
and thank you for your time,

View raw message