mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edith Au <edith...@gmail.com>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Mon, 28 Jul 2014 15:54:17 GMT
Hi Pat,

I am trying to find similar neighborhoods, base on a set of amenities
(Gyms, Cafes, Bookstores..).   Using the User/Item analogy, I am looking
for similar users.  My plan was to use RowSimilarityJob to pre-calculate
relative similarity strength between two neighborhoods, and then store the
strength vector of each neighborhood in Solr for search queries.

A search query would ask for similar neighborhoods with amenities: Gyms,
Cafes, and Bookstores.  Many neighborhoods with these 3 amenities were
skipped by RowSimilarityJob due to too little data.  I could use  Solr to
include the skipped neighborhoods in the search results.  But then weight
might become an issue between the neighborhoods who have similarity vectors
and those who don't.

For now, I am using Solr to select the related amenities and calculate
UserSimilarity dynamically.       I am going to take a look at ALS-WR
next.  Thank you for the suggestion.

Edith


On Fri, Jul 25, 2014 at 1:13 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> Are you trying to find similar users or make recommendations?
>
> If you are using the solr method you preform a solr query to get recs so
> blank rows in the indicator matrix may not be a problem unless there are
> too many. The indicator matrix shows similar items, hence the use of
> itemsimilarity on your input. If there are blank rows there, you will never
> recommend those items. This is always true of new items to the catalog
> because they don’t have interactions/purchases.
>
> If you have blank rows in your user history it may affect the quality of
> the indicator matrix but that’s all. If you have no history for a user, you
> won’t be able to do personalized recs for that user but you will be able to
> for user’s with some history.
>
> Your instinct about one item in common is probably not correct but we
> can’t really say. Too little data :-) Certainly if two people have ten
> things in common with no differences then we are more confident that they
> have the same taste, right? The algorithm takes the things in common and
> not in common to calculate the “similarity” score. It will do better with
> more non-zero data in your input.
>
> Having some all-zero rows in the indicator matrix means that you don’t
> have enough interactions with those items so they may just be new to the
> catalog.
>
> On Jul 25, 2014, at 12:50 PM, Edith Au <edith.au@gmail.com> wrote:
>
> Please correct me if I were wrong but I think my problem is related to a
> sparse populated matrix, not necessary the size (around 600 features x 18K
> users).  I played with different algos in RowSimilarityJob with different
> matrix sizes to come to this conclusion.  If I fill up the input DRM with
> random features,  the output DRM would have no row skip.
>
> I have tried adding one more indicator (ie, increase the feature size to
> around 630).  The result is better.  RowSimilarityJob returns more matching
> results.  But there are still many skipped rows.
>
> The business question I am trying to ask is "which users share similar
> features (ie. like similar items)".  It did not occur to me that it is not
> a good question to ask for the users who do not have a lot of data.  If
> user1 and user2 both like just one item, my instinct said that the
> similarity strength between the two should be hi, regardless of the size of
> the universe.
>
> I will take a look at the recommendfactroized.   Thanks very much for your
> help.  This mailing list has been great to me. :)
>
>
>
>
>
> On Thu, Jul 24, 2014 at 5:00 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> > I think the default is 1 per user but as Ted said, this will give no
> > cooccurrences—hmm, not sure why the default isn’t 2.
> >
> > A DRM can have all zero rows so there is no min for the data structure,
> > the min you are setting is for the algo, in this case rowsimilarity.
> >
> > Using the solr approach (sorry I haven’t read this whole thread) uses an
> > indicator matrix that will be 600x600 and is as good as the data you
> have.
> > Then you use the current user’s history as the solr query on the indexed
> > indicator matrix. If the current user has good history data the recs may
> be
> > OK.
> >
> > I’ve done some work looking at using other actions like views + purchases
> > and treating them all the same but for the data we used the quality of
> recs
> > did not improve very much at all. But data varies...
> >
> > There is another technique--call it cross-cooccurrence--where you use
> > purchases to find important views and then treat them all the same.
> > This may give you much more data to work with but requires that you write
> > a lot more code. We are working on a version of RSJ and itemsimilarity
> that
> > does this for you but it’s not quite ready yet.
> >
> > I think the other method Ted is talking about is ALS-WR, which is a
> latent
> > factor method that may help. The CLI is recommendfactorized
> >
> > On Jul 22, 2014, at 5:17 PM, Edith Au <edith.au@gmail.com> wrote:
> >
> > I meant to ask what is the min percentage of non zero elements in a DRM
> row
> > in order for RowSimilarityJob to generate a similarity vector.   I
> probably
> > should have asked for the maximum sparsity.
> >
> > What about using SVD for matrix decomposition? Would the SVD job returns
> a
> > DRM with Similarity vectors?  Any good sites/links to start researching
> SVD
> > would be greatly appreciated!
> >
> > Thanks!
> >
> >
> >
> >
> > On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> >> The minimum sparsity in a DRM is 0 non-zero elements in a row.
> >>
> >> That can't be what you were asking, however.  Can you expand the
> > question.
> >>
> >>
> >> On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <edith.au@gmail.com> wrote:
> >>
> >>> BTW, what is the min sparsity for a DRM?
> >>>
> >>>
> >>> On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <edith.au@gmail.com> wrote:
> >>>
> >>>> You mentioned a matrix decomposition technique.  Should I run the SVD
> >> job
> >>>> instead of RowSimilarityJob?  I found this page describes the SVD job
> >> and
> >>>> it seems like that's what I should try.  However, I notice the SVD job
> >>> does
> >>>> not need a similarity class as input.  Would the SVD job returns a DRM
> >>> with
> >>>> Similarity vectors?  Also, I am not sure how to determine the
> >>> decomposition
> >>>> rank.  In the book example above, would the rank be 600?
> >>>>
> >>>>
> >>
> https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html
> >>>>
> >>>>
> >>>> I see your point on using other information (ie browsing history) to
> >>>> "boost" correlation.   This is something I will try after my demo
> >>> deadline
> >>>> (or if I could not find a way to solve the DRM sparsity problem).
> >> BTW,
> >>> I
> >>>> took the Solr/Mahout combo approach you described in your book.  It
> >> works
> >>>> very well for the cases where a mahout Similarity vector is present.
> >>>>
> >>>> Thanks for your help.  Much appreciated
> >>>> Edith
> >>>>
> >>>>
> >>>> On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <ted.dunning@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Having such sparse data is going to make it very difficult to do
> >>> anything
> >>>>> at all.  For instance, if you have only one non-zero in a row, there
> >> is
> >>> no
> >>>>> cooccurrence to analyze and that row should be deleted.  With only
> two
> >>>>> non-zeros, you have to be very careful about drawing any inferences.
> >>>>>
> >>>>> The other aspect of sparsity is that you only have 600 books.  That
> >> may
> >>>>> mean that you would be better served by using a matrix decomposition
> >>>>> technique.
> >>>>>
> >>>>> One question I have is whether you have other actions besides
> purchase
> >>>>> that
> >>>>> indicate engagement with the books.  Can you record which users
> >> browse a
> >>>>> certain book?  How about whether they have read the reviews?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith.au@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> My RowSimiliarityJob returns a DRM with some rows missing. 
 The
> >> input
> >>>>> file
> >>>>>> is very sparse.  there are about 600 columns but only 1 - 6
would
> >>> have a
> >>>>>> value (for each row).   The output file has some rows missing.
 The
> >>>>> missing
> >>>>>> rows are the ones with only 1 - 2 values filled.  Not all rows
with
> >> 1
> >>>>> or 2
> >>>>>> values are missing, just some of them.  And the missing rows
are not
> >>>>> always
> >>>>>> the same for each RowSimilarityJob execution
> >>>>>>
> >>>>>> What I would like to achieve is to find the relative strength
> >> between
> >>>>>> rows.  For example, if there are 600 books, user1  and user2
like
> >> only
> >>>>> one
> >>>>>> book (the same book), then there should be a correlation between
> >>> these 2
> >>>>>> users.
> >>>>>>
> >>>>>> But my RowSimilarityJob output file seems to skip some of the
users
> >>> with
> >>>>>> sparse preferences.  I am running the job locally with 4 options:
> >>> input,
> >>>>>> output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would
be the
> >>>>> right
> >>>>>> approach to pick up similarity between users with sparse
> >> preferences?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Edith
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message