This doesn't have to be hard to do in realtime. With, say, 50
features, then 10 million users and items take up 2GB of heap
(assuming floats not doubles).
You can still compute a factored matrix model regularly in the
background, while folding in new data in real time, projecting new
users / data into the space approximately.
What you're suggesting sounds like a neighborhoodbased approach, that
just throws in clustering as a speedup. Yes that is broadly a valid
approach.
The L2 norm / squared error loss function does show up in NNMF, where
you're minimizing the reconstruction error. You also frequently
minimize Euclidean distances in kmeans, which is also squared error
loss. I don't think that's a reason they're both necessarily
applicable here, though I think both are.
I am not sure the Euclidean distance is going to give reasonable
results in such high dimensional and sparse space and (I think?)
people usually swap in the L1 norm for example. Others can give more
informed commentary on that. This is part of the reason you'd do
dimensionality reduction to begin with, to make the clustering work.
But if you're doing that you have a basis for recommendations already,
if you've already done the NNMF. I wouldn't rule it out. To me I think
adding kmeans in the mix invites more questions and potential issues
on this input than it may resolve.
(This is exactly what Myrrix does, and this scale wouldn't be a big
deal, most especially since you can just partition by user and scale
up to as many servers as you want. I would encourage you look at
http://myrrix.com as it sounds like it matches up strongly with your
situation.)
Sean
On Fri, Oct 5, 2012 at 10:44 AM, Johannes Schulte
<johannes.schulte@gmail.com> wrote:
> Hi!
>
> I got a question concerning a recommendation / classification problem which
> i originally wanted to solve with matrix factorization methods from taste /
> mahout.
>
> It has the following properties.
>
>  There are about ~200k items
>  There are a lot more users (say, millions) and they are very volatile
> (like sessions)
>  There is no need for the user factor matrix since the recommendation is
> very "neartime" dependent. At the time of deployment the user factors need
> to be constructed from the items they interacted with in the last seconds,
> hence relying on an hourly deployment cycle is not suitable.
>  The double user factor arrays for the matrix factorization technique
> become very large
>
> The question now is:
>
> Given that im only interested in item latent features, how differs that
> from a soft kmeans clustering over items (with coocurrence vectors?)
> I think the recommendation then could also be expressed as a linear
> combination of distances to clusters. Some papers suggest that nmf and
> kmeans use basically the same loss function, so i hope it's not a totally
> stupid idea.
>
> The cluster membership vectors (or latent features) should be used later on
> as input to a regression model, that's why a neighbourhood approach doesn't
> fit
>
> The main benefit for me would be
>
> 1. Simplicity
> 2. Performance ( I don't need a running cluster for KMeans it works pretty
> well on one machine, as opposed to mf)
> 3. Maybe more freedom to include side information into the clustering
> without implementing a new mf technique in mahout
> 4. Incremental updates of clusters to model variations over time, maybe
> someday with the streaming kmeans thing
>
> Thanks for time, i'd appreciate any opinions
>
> Johannes
