spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reza Zadeh <>
Subject Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded
Date Mon, 02 Mar 2015 20:56:24 GMT
Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA:
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider
using K-means - it will get you clusters of similar rows with euclidean


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <> wrote:

> ​Hi Reza
> ​​
> I see that ((int, int), double) pairs are generated for any combination
> that meets the criteria controlled by the threshold. But assuming a simple
> 1x10K matrix that means I would need atleast 12GB memory per executor for
> the flat map just for these pairs excluding any other overhead. Is that
> correct? How can we make this scale for even larger n (when m stays small)
> like 100 x 5 million.​ One is by using higher thresholds. The other is that
> I use a SparseVector to begin with. Are there any other optimizations I can
> take advantage of?
> ​Thanks
> Sab

View raw message