mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: K-Means as a surrogate for Matrix Factorization
Date Fri, 05 Oct 2012 11:42:36 GMT
Sean,

thanks for your input.

It's more like 30 million users + id mapping for both items and users, but
i could probably sample that to something that fits into memory

My main point was that I'm more interested in the decomposition of users
into a linear combination of factors, as this is needed to construct
feature vectors for a multinominal regression. Thats why MF came to my
mind. (It would also have the nice effect that all user*feature vectors
would be present after the factorization, ready to be fed into
LogisticRegression)

The recommendation afterwards is more driven by expected profit (per
multinominal target class), where the matrix factorization helps in the way
that it can almost always recommend something given a restriction of
"CandidateItems", while neighbourhood models (at least pre computed
item-item similarities) are not suitable for this ad hoc restrictions.

So since I'm using the user-item interaction more like clusters i was
wondering if it wouldn't reduce complexity to leave the user out of the
equation. The distance I was planning to use was cosine distance anyway,
but thanks for the warning.

I think I'll try both and if something exciting happens I'm gonna tell the
whole world (probably not).

Have a nice day!

Johannes

On Fri, Oct 5, 2012 at 12:40 PM, Sean Owen <srowen@gmail.com> wrote:

> This doesn't have to be hard to do in real-time. With, say, 50
> features, then 10 million users and items take up 2GB of heap
> (assuming floats not doubles).
>
> You can still compute a factored matrix model regularly in the
> background, while folding in new data in real time, projecting new
> users / data into the space approximately.
>
> What you're suggesting sounds like a neighborhood-based approach, that
> just throws in clustering as a speed-up. Yes that is broadly a valid
> approach.
>
> The L2 norm / squared error loss function does show up in NNMF, where
> you're minimizing the reconstruction error. You also frequently
> minimize Euclidean distances in k-means, which is also squared error
> loss. I don't think that's a reason they're both necessarily
> applicable here, though I think both are.
>
> I am not sure the Euclidean distance is going to give reasonable
> results in such high dimensional and sparse space and (I think?)
> people usually swap in the L1 norm for example. Others can give more
> informed commentary on that. This is part of the reason you'd do
> dimensionality reduction to begin with, to make the clustering work.
>
> But if you're doing that you have a basis for recommendations already,
> if you've already done the NNMF. I wouldn't rule it out. To me I think
> adding k-means in the mix invites more questions and potential issues
> on this input than it may resolve.
>
>
> (This is exactly what Myrrix does, and this scale wouldn't be a big
> deal, most especially since you can just partition by user and scale
> up to as many servers as you want. I would encourage you look at
> http://myrrix.com as it sounds like it matches up strongly with your
> situation.)
>
> Sean
>
> On Fri, Oct 5, 2012 at 10:44 AM, Johannes Schulte
> <johannes.schulte@gmail.com> wrote:
> > Hi!
> >
> > I got a question concerning a recommendation / classification problem
> which
> > i originally wanted to solve with matrix factorization methods from
> taste /
> > mahout.
> >
> > It has the following properties.
> >
> > - There are about ~200k items
> > - There are a lot more users (say, millions) and they are very volatile
> > (like sessions)
> > - There is no need for the user factor matrix since the recommendation is
> > very "near-time" dependent. At the time of deployment the user factors
> need
> > to be constructed from the items they interacted with in the last
> seconds,
> > hence relying on an hourly deployment cycle is not suitable.
> > - The double user factor arrays for the matrix factorization technique
> > become very large
> >
> > The question now is:
> >
> > Given that im only interested in item latent features, how differs that
> > from a soft k-means clustering over items (with coocurrence vectors?)
> > I think the recommendation then could also be expressed as a linear
> > combination of distances to clusters. Some papers suggest that nmf and
> > k-means use basically the same loss function, so i hope it's not a
> totally
> > stupid idea.
> >
> > The cluster membership vectors (or latent features) should be used later
> on
> > as input to a regression model, that's why a neighbourhood approach
> doesn't
> > fit
> >
> > The main benefit for me would be
> >
> > 1. Simplicity
> > 2. Performance ( I don't need a running cluster for K-Means it works
> pretty
> > well on one machine, as opposed to mf)
> > 3. Maybe more freedom to include side information into the clustering
> > without implementing a new mf technique in mahout
> > 4. Incremental updates of clusters to model variations over time, maybe
> > someday with the streaming k-means thing
> >
> > Thanks for time, i'd appreciate any opinions
> >
> > Johannes
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message