mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edith Au <edith...@gmail.com>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Wed, 23 Jul 2014 00:17:09 GMT
I meant to ask what is the min percentage of non zero elements in a DRM row
in order for RowSimilarityJob to generate a similarity vector.   I probably
should have asked for the maximum sparsity.

What about using SVD for matrix decomposition? Would the SVD job returns a
DRM with Similarity vectors?  Any good sites/links to start researching SVD
would be greatly appreciated!

Thanks!




On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The minimum sparsity in a DRM is 0 non-zero elements in a row.
>
> That can't be what you were asking, however.  Can you expand the question.
>
>
> On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <edith.au@gmail.com> wrote:
>
> > BTW, what is the min sparsity for a DRM?
> >
> >
> > On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <edith.au@gmail.com> wrote:
> >
> > > You mentioned a matrix decomposition technique.  Should I run the SVD
> job
> > > instead of RowSimilarityJob?  I found this page describes the SVD job
> and
> > > it seems like that's what I should try.  However, I notice the SVD job
> > does
> > > not need a similarity class as input.  Would the SVD job returns a DRM
> > with
> > > Similarity vectors?  Also, I am not sure how to determine the
> > decomposition
> > > rank.  In the book example above, would the rank be 600?
> > >
> > >
> https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html
> > >
> > >
> > > I see your point on using other information (ie browsing history) to
> > > "boost" correlation.   This is something I will try after my demo
> > deadline
> > > (or if I could not find a way to solve the DRM sparsity problem).
> BTW,
> > I
> > > took the Solr/Mahout combo approach you described in your book.  It
> works
> > > very well for the cases where a mahout Similarity vector is present.
> > >
> > > Thanks for your help.  Much appreciated
> > > Edith
> > >
> > >
> > > On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > >> Having such sparse data is going to make it very difficult to do
> > anything
> > >> at all.  For instance, if you have only one non-zero in a row, there
> is
> > no
> > >> cooccurrence to analyze and that row should be deleted.  With only two
> > >> non-zeros, you have to be very careful about drawing any inferences.
> > >>
> > >> The other aspect of sparsity is that you only have 600 books.  That
> may
> > >> mean that you would be better served by using a matrix decomposition
> > >> technique.
> > >>
> > >> One question I have is whether you have other actions besides purchase
> > >> that
> > >> indicate engagement with the books.  Can you record which users
> browse a
> > >> certain book?  How about whether they have read the reviews?
> > >>
> > >>
> > >>
> > >> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith.au@gmail.com> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > My RowSimiliarityJob returns a DRM with some rows missing.   The
> input
> > >> file
> > >> > is very sparse.  there are about 600 columns but only 1 - 6 would
> > have a
> > >> > value (for each row).   The output file has some rows missing.  The
> > >> missing
> > >> > rows are the ones with only 1 - 2 values filled.  Not all rows with
> 1
> > >> or 2
> > >> > values are missing, just some of them.  And the missing rows are not
> > >> always
> > >> > the same for each RowSimilarityJob execution
> > >> >
> > >> > What I would like to achieve is to find the relative strength
> between
> > >> > rows.  For example, if there are 600 books, user1  and user2 like
> only
> > >> one
> > >> > book (the same book), then there should be a correlation between
> > these 2
> > >> > users.
> > >> >
> > >> > But my RowSimilarityJob output file seems to skip some of the users
> > with
> > >> > sparse preferences.  I am running the job locally with 4 options:
> > input,
> > >> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the
> > >> right
> > >> > approach to pick up similarity between users with sparse
> preferences?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > Edith
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message