spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded
Date Tue, 03 Mar 2015 01:41:03 GMT
Sab, not sure what you require for the similarity metric or your use case but you can also
look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here
<>.  These are
optimized for LLR based “similarity” which is very simple to calculate since you don’t
use either the item weight or the entire row or column vector values. Downsampling is done
by number of values per column (or row) and by LLR strength. This keeps it to O(n)

They run pretty fast and only use memory if you use the version that attaches application
IDs to the rows and columns. Using SimilarityAnalysis.cooccurrence may help. It’s in the
Spark/Scala part of Mahout.

On Mar 2, 2015, at 12:56 PM, Reza Zadeh <> wrote:

Hi Sab,
The current method is optimized for having many rows and few columns. In your case it is exactly
the opposite. We are working on your case, tracked by this JIRA:
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider using K-means -
it will get you clusters of similar rows with euclidean distance.


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan < <>>

​Hi Reza
I see that ((int, int), double) pairs are generated for any combination that meets the criteria
controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast
12GB memory per executor for the flat map just for these pairs excluding any other overhead.
Is that correct? How can we make this scale for even larger n (when m stays small) like 100
x 5 million.​ One is by using higher thresholds. The other is that I use a SparseVector
to begin with. Are there any other optimizations I can take advantage of?


View raw message