mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Fri, 25 Jul 2014 00:00:30 GMT
I think the default is 1 per user but as Ted said, this will give no cooccurrences—hmm, not
sure why the default isn’t 2.

A DRM can have all zero rows so there is no min for the data structure, the min you are setting
is for the algo, in this case rowsimilarity.

Using the solr approach (sorry I haven’t read this whole thread) uses an indicator matrix
that will be 600x600 and is as good as the data you have. Then you use the current user’s
history as the solr query on the indexed indicator matrix. If the current user has good history
data the recs may be OK.

I’ve done some work looking at using other actions like views + purchases and treating them
all the same but for the data we used the quality of recs did not improve very much at all.
But data varies...

There is another technique--call it cross-cooccurrence--where you use purchases to find important
views and then treat them all the same. 
This may give you much more data to work with but requires that you write a lot more code.
We are working on a version of RSJ and itemsimilarity that does this for you but it’s not
quite ready yet.

I think the other method Ted is talking about is ALS-WR, which is a latent factor method that
may help. The CLI is recommendfactorized

On Jul 22, 2014, at 5:17 PM, Edith Au <> wrote:

I meant to ask what is the min percentage of non zero elements in a DRM row
in order for RowSimilarityJob to generate a similarity vector.   I probably
should have asked for the maximum sparsity.

What about using SVD for matrix decomposition? Would the SVD job returns a
DRM with Similarity vectors?  Any good sites/links to start researching SVD
would be greatly appreciated!


On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <> wrote:

> The minimum sparsity in a DRM is 0 non-zero elements in a row.
> That can't be what you were asking, however.  Can you expand the question.
> On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <> wrote:
>> BTW, what is the min sparsity for a DRM?
>> On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <> wrote:
>>> You mentioned a matrix decomposition technique.  Should I run the SVD
> job
>>> instead of RowSimilarityJob?  I found this page describes the SVD job
> and
>>> it seems like that's what I should try.  However, I notice the SVD job
>> does
>>> not need a similarity class as input.  Would the SVD job returns a DRM
>> with
>>> Similarity vectors?  Also, I am not sure how to determine the
>> decomposition
>>> rank.  In the book example above, would the rank be 600?
>>> I see your point on using other information (ie browsing history) to
>>> "boost" correlation.   This is something I will try after my demo
>> deadline
>>> (or if I could not find a way to solve the DRM sparsity problem).
> BTW,
>> I
>>> took the Solr/Mahout combo approach you described in your book.  It
> works
>>> very well for the cases where a mahout Similarity vector is present.
>>> Thanks for your help.  Much appreciated
>>> Edith
>>> On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <>
>>> wrote:
>>>> Having such sparse data is going to make it very difficult to do
>> anything
>>>> at all.  For instance, if you have only one non-zero in a row, there
> is
>> no
>>>> cooccurrence to analyze and that row should be deleted.  With only two
>>>> non-zeros, you have to be very careful about drawing any inferences.
>>>> The other aspect of sparsity is that you only have 600 books.  That
> may
>>>> mean that you would be better served by using a matrix decomposition
>>>> technique.
>>>> One question I have is whether you have other actions besides purchase
>>>> that
>>>> indicate engagement with the books.  Can you record which users
> browse a
>>>> certain book?  How about whether they have read the reviews?
>>>> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <> wrote:
>>>>> Hi
>>>>> My RowSimiliarityJob returns a DRM with some rows missing.   The
> input
>>>> file
>>>>> is very sparse.  there are about 600 columns but only 1 - 6 would
>> have a
>>>>> value (for each row).   The output file has some rows missing.  The
>>>> missing
>>>>> rows are the ones with only 1 - 2 values filled.  Not all rows with
> 1
>>>> or 2
>>>>> values are missing, just some of them.  And the missing rows are not
>>>> always
>>>>> the same for each RowSimilarityJob execution
>>>>> What I would like to achieve is to find the relative strength
> between
>>>>> rows.  For example, if there are 600 books, user1  and user2 like
> only
>>>> one
>>>>> book (the same book), then there should be a correlation between
>> these 2
>>>>> users.
>>>>> But my RowSimilarityJob output file seems to skip some of the users
>> with
>>>>> sparse preferences.  I am running the job locally with 4 options:
>> input,
>>>>> output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the
>>>> right
>>>>> approach to pick up similarity between users with sparse
> preferences?
>>>>> Thanks!
>>>>> Edith

View raw message