spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankur Dave <>
Subject Re: GraphX partition problem
Date Sun, 25 May 2014 21:22:50 GMT
Once the graph is built, edges are stored in parallel primitive arrays, so
each edge should only take 20 bytes to store (srcId: Long, dstId: Long,
attr: Int). Unfortunately, the current implementation in
EdgePartitionBuilder uses an array of Edge objects as an intermediate
representation for sorting, so each edge will additionally take something
like 40 bytes during graph construction (srcId (8) + dstId (8) + attr (4) +
uncompressed pointer (8) + object overhead (8) + padding (4)). So you'll
need about 1.2 TB of memory in total (60 bytes * 20 billion). Does your
cluster have this much memory?

If not, I've been meaning to write a custom sort routine that can operate
directly on the three parallel arrays. This will reduce the memory
requirements to about 400 GB.

Ankur <>

View raw message