mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <>
Subject K-Means as a surrogate for Matrix Factorization
Date Fri, 05 Oct 2012 09:44:22 GMT

I got a question concerning a recommendation / classification problem which
i originally wanted to solve with matrix factorization methods from taste /

It has the following properties.

- There are about ~200k items
- There are a lot more users (say, millions) and they are very volatile
(like sessions)
- There is no need for the user factor matrix since the recommendation is
very "near-time" dependent. At the time of deployment the user factors need
to be constructed from the items they interacted with in the last seconds,
hence relying on an hourly deployment cycle is not suitable.
- The double user factor arrays for the matrix factorization technique
become very large

The question now is:

Given that im only interested in item latent features, how differs that
from a soft k-means clustering over items (with coocurrence vectors?)
I think the recommendation then could also be expressed as a linear
combination of distances to clusters. Some papers suggest that nmf and
k-means use basically the same loss function, so i hope it's not a totally
stupid idea.

The cluster membership vectors (or latent features) should be used later on
as input to a regression model, that's why a neighbourhood approach doesn't

The main benefit for me would be

1. Simplicity
2. Performance ( I don't need a running cluster for K-Means it works pretty
well on one machine, as opposed to mf)
3. Maybe more freedom to include side information into the clustering
without implementing a new mf technique in mahout
4. Incremental updates of clusters to model variations over time, maybe
someday with the streaming k-means thing

Thanks for time, i'd appreciate any opinions


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message