spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debasish Das <debasish.da...@gmail.com>
Subject Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded
Date Mon, 02 Mar 2015 07:22:56 GMT
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
you can apply in reduceByKey (you have to change the code to use
combineByKey and add your filters before shuffling the keys to reducer)...

The other variant that you are mentioning is row based similarity flow
which is tracked in the following JIRA where I am interesting in doing no
shuffle but use broadcast and mapPartitions. I will open up the PR soon but
it is compute intensive and I am experimenting with BLAS optimizations...

https://issues.apache.org/jira/browse/SPARK-4823

Your case of 100 x 5 million (tranpose of it) for example is very common in
matrix factorization where you have user factors and product factors which
will typically be 5 million x 100 dense matrix and you want to compute
user->user and item->item similarities...

You are right that sparsity helps but you can't apply sparsity (for example
pick topK) before doing the dot products...so it is still a compute
intensive operation...

On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> ​Hi Reza
> ​​
> I see that ((int, int), double) pairs are generated for any combination
> that meets the criteria controlled by the threshold. But assuming a simple
> 1x10K matrix that means I would need atleast 12GB memory per executor for
> the flat map just for these pairs excluding any other overhead. Is that
> correct? How can we make this scale for even larger n (when m stays small)
> like 100 x 5 million.​ One is by using higher thresholds. The other is that
> I use a SparseVector to begin with. Are there any other optimizations I can
> take advantage of?
>
> ​Thanks
> Sab
>
>

Mime
View raw message