mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Fri, 25 Jul 2014 20:13:52 GMT
Are you trying to find similar users or make recommendations?

If you are using the solr method you preform a solr query to get recs so blank rows in the
indicator matrix may not be a problem unless there are too many. The indicator matrix shows
similar items, hence the use of itemsimilarity on your input. If there are blank rows there,
you will never recommend those items. This is always true of new items to the catalog because
they don’t have interactions/purchases.

If you have blank rows in your user history it may affect the quality of the indicator matrix
but that’s all. If you have no history for a user, you won’t be able to do personalized
recs for that user but you will be able to for user’s with some history.

Your instinct about one item in common is probably not correct but we can’t really say.
Too little data :-) Certainly if two people have ten things in common with no differences
then we are more confident that they have the same taste, right? The algorithm takes the things
in common and not in common to calculate the “similarity” score. It will do better with
more non-zero data in your input.

Having some all-zero rows in the indicator matrix means that you don’t have enough interactions
with those items so they may just be new to the catalog.

On Jul 25, 2014, at 12:50 PM, Edith Au <> wrote:

Please correct me if I were wrong but I think my problem is related to a
sparse populated matrix, not necessary the size (around 600 features x 18K
users).  I played with different algos in RowSimilarityJob with different
matrix sizes to come to this conclusion.  If I fill up the input DRM with
random features,  the output DRM would have no row skip.

I have tried adding one more indicator (ie, increase the feature size to
around 630).  The result is better.  RowSimilarityJob returns more matching
results.  But there are still many skipped rows.

The business question I am trying to ask is "which users share similar
features (ie. like similar items)".  It did not occur to me that it is not
a good question to ask for the users who do not have a lot of data.  If
user1 and user2 both like just one item, my instinct said that the
similarity strength between the two should be hi, regardless of the size of
the universe.

I will take a look at the recommendfactroized.   Thanks very much for your
help.  This mailing list has been great to me. :)

On Thu, Jul 24, 2014 at 5:00 PM, Pat Ferrel <> wrote:

> I think the default is 1 per user but as Ted said, this will give no
> cooccurrences—hmm, not sure why the default isn’t 2.
> A DRM can have all zero rows so there is no min for the data structure,
> the min you are setting is for the algo, in this case rowsimilarity.
> Using the solr approach (sorry I haven’t read this whole thread) uses an
> indicator matrix that will be 600x600 and is as good as the data you have.
> Then you use the current user’s history as the solr query on the indexed
> indicator matrix. If the current user has good history data the recs may be
> OK.
> I’ve done some work looking at using other actions like views + purchases
> and treating them all the same but for the data we used the quality of recs
> did not improve very much at all. But data varies...
> There is another technique--call it cross-cooccurrence--where you use
> purchases to find important views and then treat them all the same.
> This may give you much more data to work with but requires that you write
> a lot more code. We are working on a version of RSJ and itemsimilarity that
> does this for you but it’s not quite ready yet.
> I think the other method Ted is talking about is ALS-WR, which is a latent
> factor method that may help. The CLI is recommendfactorized
> On Jul 22, 2014, at 5:17 PM, Edith Au <> wrote:
> I meant to ask what is the min percentage of non zero elements in a DRM row
> in order for RowSimilarityJob to generate a similarity vector.   I probably
> should have asked for the maximum sparsity.
> What about using SVD for matrix decomposition? Would the SVD job returns a
> DRM with Similarity vectors?  Any good sites/links to start researching SVD
> would be greatly appreciated!
> Thanks!
> On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <>
> wrote:
>> The minimum sparsity in a DRM is 0 non-zero elements in a row.
>> That can't be what you were asking, however.  Can you expand the
> question.
>> On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <> wrote:
>>> BTW, what is the min sparsity for a DRM?
>>> On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <> wrote:
>>>> You mentioned a matrix decomposition technique.  Should I run the SVD
>> job
>>>> instead of RowSimilarityJob?  I found this page describes the SVD job
>> and
>>>> it seems like that's what I should try.  However, I notice the SVD job
>>> does
>>>> not need a similarity class as input.  Would the SVD job returns a DRM
>>> with
>>>> Similarity vectors?  Also, I am not sure how to determine the
>>> decomposition
>>>> rank.  In the book example above, would the rank be 600?
>>>> I see your point on using other information (ie browsing history) to
>>>> "boost" correlation.   This is something I will try after my demo
>>> deadline
>>>> (or if I could not find a way to solve the DRM sparsity problem).
>> BTW,
>>> I
>>>> took the Solr/Mahout combo approach you described in your book.  It
>> works
>>>> very well for the cases where a mahout Similarity vector is present.
>>>> Thanks for your help.  Much appreciated
>>>> Edith
>>>> On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <>
>>>> wrote:
>>>>> Having such sparse data is going to make it very difficult to do
>>> anything
>>>>> at all.  For instance, if you have only one non-zero in a row, there
>> is
>>> no
>>>>> cooccurrence to analyze and that row should be deleted.  With only two
>>>>> non-zeros, you have to be very careful about drawing any inferences.
>>>>> The other aspect of sparsity is that you only have 600 books.  That
>> may
>>>>> mean that you would be better served by using a matrix decomposition
>>>>> technique.
>>>>> One question I have is whether you have other actions besides purchase
>>>>> that
>>>>> indicate engagement with the books.  Can you record which users
>> browse a
>>>>> certain book?  How about whether they have read the reviews?
>>>>> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <>
>>>>>> Hi
>>>>>> My RowSimiliarityJob returns a DRM with some rows missing.   The
>> input
>>>>> file
>>>>>> is very sparse.  there are about 600 columns but only 1 - 6 would
>>> have a
>>>>>> value (for each row).   The output file has some rows missing.  The
>>>>> missing
>>>>>> rows are the ones with only 1 - 2 values filled.  Not all rows with
>> 1
>>>>> or 2
>>>>>> values are missing, just some of them.  And the missing rows are
>>>>> always
>>>>>> the same for each RowSimilarityJob execution
>>>>>> What I would like to achieve is to find the relative strength
>> between
>>>>>> rows.  For example, if there are 600 books, user1  and user2 like
>> only
>>>>> one
>>>>>> book (the same book), then there should be a correlation between
>>> these 2
>>>>>> users.
>>>>>> But my RowSimilarityJob output file seems to skip some of the users
>>> with
>>>>>> sparse preferences.  I am running the job locally with 4 options:
>>> input,
>>>>>> output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the
>>>>> right
>>>>>> approach to pick up similarity between users with sparse
>> preferences?
>>>>>> Thanks!
>>>>>> Edith

View raw message