Hi!
I got a question concerning a recommendation / classification problem which
i originally wanted to solve with matrix factorization methods from taste /
mahout.
It has the following properties.
 There are about ~200k items
 There are a lot more users (say, millions) and they are very volatile
(like sessions)
 There is no need for the user factor matrix since the recommendation is
very "neartime" dependent. At the time of deployment the user factors need
to be constructed from the items they interacted with in the last seconds,
hence relying on an hourly deployment cycle is not suitable.
 The double user factor arrays for the matrix factorization technique
become very large
The question now is:
Given that im only interested in item latent features, how differs that
from a soft kmeans clustering over items (with coocurrence vectors?)
I think the recommendation then could also be expressed as a linear
combination of distances to clusters. Some papers suggest that nmf and
kmeans use basically the same loss function, so i hope it's not a totally
stupid idea.
The cluster membership vectors (or latent features) should be used later on
as input to a regression model, that's why a neighbourhood approach doesn't
fit
The main benefit for me would be
1. Simplicity
2. Performance ( I don't need a running cluster for KMeans it works pretty
well on one machine, as opposed to mf)
3. Maybe more freedom to include side information into the clustering
without implementing a new mf technique in mahout
4. Incremental updates of clusters to model variations over time, maybe
someday with the streaming kmeans thing
Thanks for time, i'd appreciate any opinions
Johannes
