spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sabarish Sasidharan <>
Subject Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded
Date Tue, 03 Mar 2015 05:48:52 GMT
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing
the similarities computation. So a rowSimiliarities() would be a good fit,
looking forward to it.

In the meanwhile I will try to see if I can further limit the number of
similarities computed through some other fashion or use kmeans instead or a
combination of both. I have also been looking at Mahout's similarity
recommenders based on spark, but not sure if the row similarity would apply
in my case as my matrix is pretty dense.


On Tue, Mar 3, 2015 at 7:11 AM, Pat Ferrel <> wrote:

> Sab, not sure what you require for the similarity metric or your use case
> but you can also look at spark-rowsimilarity or spark-itemsimilarity
> (column-wise) here
> These are optimized for LLR based “similarity” which is very simple to
> calculate since you don’t use either the item weight or the entire row or
> column vector values. Downsampling is done by number of values per column
> (or row) and by LLR strength. This keeps it to O(n)
> They run pretty fast and only use memory if you use the version that
> attaches application IDs to the rows and columns. Using
> SimilarityAnalysis.cooccurrence may help. It’s in the Spark/Scala part of
> Mahout.
> On Mar 2, 2015, at 12:56 PM, Reza Zadeh <> wrote:
> Hi Sab,
> The current method is optimized for having many rows and few columns. In
> your case it is exactly the opposite. We are working on your case, tracked
> by this JIRA:
> Your case is very common, so I will put some time into building it.
> In the meantime, if you're looking for groups of similar points, consider
> using K-means - it will get you clusters of similar rows with euclidean
> distance.
> Best,
> Reza
> On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
>> wrote:
>> ​Hi Reza
>> ​​
>> I see that ((int, int), double) pairs are generated for any combination
>> that meets the criteria controlled by the threshold. But assuming a simple
>> 1x10K matrix that means I would need atleast 12GB memory per executor for
>> the flat map just for these pairs excluding any other overhead. Is that
>> correct? How can we make this scale for even larger n (when m stays small)
>> like 100 x 5 million.​ One is by using higher thresholds. The other is that
>> I use a SparseVector to begin with. Are there any other optimizations I can
>> take advantage of?
>> ​Thanks
>> Sab


Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*

View raw message