mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Tue, 22 Jul 2014 20:05:19 GMT
The minimum sparsity in a DRM is 0 non-zero elements in a row.

That can't be what you were asking, however.  Can you expand the question.


On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <edith.au@gmail.com> wrote:

> BTW, what is the min sparsity for a DRM?
>
>
> On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <edith.au@gmail.com> wrote:
>
> > You mentioned a matrix decomposition technique.  Should I run the SVD job
> > instead of RowSimilarityJob?  I found this page describes the SVD job and
> > it seems like that's what I should try.  However, I notice the SVD job
> does
> > not need a similarity class as input.  Would the SVD job returns a DRM
> with
> > Similarity vectors?  Also, I am not sure how to determine the
> decomposition
> > rank.  In the book example above, would the rank be 600?
> >
> > https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html
> >
> >
> > I see your point on using other information (ie browsing history) to
> > "boost" correlation.   This is something I will try after my demo
> deadline
> > (or if I could not find a way to solve the DRM sparsity problem).   BTW,
> I
> > took the Solr/Mahout combo approach you described in your book.  It works
> > very well for the cases where a mahout Similarity vector is present.
> >
> > Thanks for your help.  Much appreciated
> > Edith
> >
> >
> > On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> >> Having such sparse data is going to make it very difficult to do
> anything
> >> at all.  For instance, if you have only one non-zero in a row, there is
> no
> >> cooccurrence to analyze and that row should be deleted.  With only two
> >> non-zeros, you have to be very careful about drawing any inferences.
> >>
> >> The other aspect of sparsity is that you only have 600 books.  That may
> >> mean that you would be better served by using a matrix decomposition
> >> technique.
> >>
> >> One question I have is whether you have other actions besides purchase
> >> that
> >> indicate engagement with the books.  Can you record which users browse a
> >> certain book?  How about whether they have read the reviews?
> >>
> >>
> >>
> >> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith.au@gmail.com> wrote:
> >>
> >> > Hi
> >> >
> >> > My RowSimiliarityJob returns a DRM with some rows missing.   The input
> >> file
> >> > is very sparse.  there are about 600 columns but only 1 - 6 would
> have a
> >> > value (for each row).   The output file has some rows missing.  The
> >> missing
> >> > rows are the ones with only 1 - 2 values filled.  Not all rows with 1
> >> or 2
> >> > values are missing, just some of them.  And the missing rows are not
> >> always
> >> > the same for each RowSimilarityJob execution
> >> >
> >> > What I would like to achieve is to find the relative strength between
> >> > rows.  For example, if there are 600 books, user1  and user2 like only
> >> one
> >> > book (the same book), then there should be a correlation between
> these 2
> >> > users.
> >> >
> >> > But my RowSimilarityJob output file seems to skip some of the users
> with
> >> > sparse preferences.  I am running the job locally with 4 options:
> input,
> >> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the
> >> right
> >> > approach to pick up similarity between users with sparse preferences?
> >> >
> >> > Thanks!
> >> >
> >> > Edith
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message