Johannes,
Funny you should mention matrix factorization and k-means at the same
moment. I am talking this afternoon in Oxford about just this topic.
Yes, you can use the proximity to near clusters as a useful modeling
feature, but as Sean said, the cost of matrix factorization should not be
the motivating factor for that. There is also a lot of runtime on matrix
factor recommenders and little on radial basis function ones.
I have been noodling for a bit that reducing the dimension through random
projection and then doing k-means in that space and the associated cluster
proximities would make a good basis space for almost any modeling, but
reversing the mapping through the stochastic projection requires that you
solve a compressive sensing sort of problem which involves L_1 regularized
solution of an underdetermined linear system. This will be almost exactly
the same amount of work as doing a stochastic projection SVD and thus quite
similar to the ALS sorts of solvers as well.
Clustering can also be quite expensive. If you use the new k-means stuff
that isn't quite in Mahout yet, you can get the cost down quite low, but
that is very new code. It does give two or three orders of magnitude
speedup, but it has zero production runtime.
I would thus only expect a win if you use the new clustering and don't need
explanations and can arrange recommendations to avoid solving the reverse
problem.
On Fri, Oct 5, 2012 at 12:42 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:
> Sean,
>
> thanks for your input.
>
> It's more like 30 million users + id mapping for both items and users, but
> i could probably sample that to something that fits into memory
>
> My main point was that I'm more interested in the decomposition of users
> into a linear combination of factors, as this is needed to construct
> feature vectors for a multinominal regression. Thats why MF came to my
> mind. (It would also have the nice effect that all user*feature vectors
> would be present after the factorization, ready to be fed into
> LogisticRegression)
>
> The recommendation afterwards is more driven by expected profit (per
> multinominal target class), where the matrix factorization helps in the way
> that it can almost always recommend something given a restriction of
> "CandidateItems", while neighbourhood models (at least pre computed
> item-item similarities) are not suitable for this ad hoc restrictions.
>
> So since I'm using the user-item interaction more like clusters i was
> wondering if it wouldn't reduce complexity to leave the user out of the
> equation. The distance I was planning to use was cosine distance anyway,
> but thanks for the warning.
>
> I think I'll try both and if something exciting happens I'm gonna tell the
> whole world (probably not).
>
> Have a nice day!
>
> Johannes
>
> On Fri, Oct 5, 2012 at 12:40 PM, Sean Owen wrote:
>
> > This doesn't have to be hard to do in real-time. With, say, 50
> > features, then 10 million users and items take up 2GB of heap
> > (assuming floats not doubles).
> >
> > You can still compute a factored matrix model regularly in the
> > background, while folding in new data in real time, projecting new
> > users / data into the space approximately.
> >
> > What you're suggesting sounds like a neighborhood-based approach, that
> > just throws in clustering as a speed-up. Yes that is broadly a valid
> > approach.
> >
> > The L2 norm / squared error loss function does show up in NNMF, where
> > you're minimizing the reconstruction error. You also frequently
> > minimize Euclidean distances in k-means, which is also squared error
> > loss. I don't think that's a reason they're both necessarily
> > applicable here, though I think both are.
> >
> > I am not sure the Euclidean distance is going to give reasonable
> > results in such high dimensional and sparse space and (I think?)
> > people usually swap in the L1 norm for example. Others can give more
> > informed commentary on that. This is part of the reason you'd do
> > dimensionality reduction to begin with, to make the clustering work.
> >
> > But if you're doing that you have a basis for recommendations already,
> > if you've already done the NNMF. I wouldn't rule it out. To me I think
> > adding k-means in the mix invites more questions and potential issues
> > on this input than it may resolve.
> >
> >
> > (This is exactly what Myrrix does, and this scale wouldn't be a big
> > deal, most especially since you can just partition by user and scale
> > up to as many servers as you want. I would encourage you look at
> > http://myrrix.com as it sounds like it matches up strongly with your
> > situation.)
> >
> > Sean
> >
> > On Fri, Oct 5, 2012 at 10:44 AM, Johannes Schulte
> > wrote:
> > > Hi!
> > >
> > > I got a question concerning a recommendation / classification problem
> > which
> > > i originally wanted to solve with matrix factorization methods from
> > taste /
> > > mahout.
> > >
> > > It has the following properties.
> > >
> > > - There are about ~200k items
> > > - There are a lot more users (say, millions) and they are very volatile
> > > (like sessions)
> > > - There is no need for the user factor matrix since the recommendation
> is
> > > very "near-time" dependent. At the time of deployment the user factors
> > need
> > > to be constructed from the items they interacted with in the last
> > seconds,
> > > hence relying on an hourly deployment cycle is not suitable.
> > > - The double user factor arrays for the matrix factorization technique
> > > become very large
> > >
> > > The question now is:
> > >
> > > Given that im only interested in item latent features, how differs that
> > > from a soft k-means clustering over items (with coocurrence vectors?)
> > > I think the recommendation then could also be expressed as a linear
> > > combination of distances to clusters. Some papers suggest that nmf and
> > > k-means use basically the same loss function, so i hope it's not a
> > totally
> > > stupid idea.
> > >
> > > The cluster membership vectors (or latent features) should be used
> later
> > on
> > > as input to a regression model, that's why a neighbourhood approach
> > doesn't
> > > fit
> > >
> > > The main benefit for me would be
> > >
> > > 1. Simplicity
> > > 2. Performance ( I don't need a running cluster for K-Means it works
> > pretty
> > > well on one machine, as opposed to mf)
> > > 3. Maybe more freedom to include side information into the clustering
> > > without implementing a new mf technique in mahout
> > > 4. Incremental updates of clusters to model variations over time, maybe
> > > someday with the streaming k-means thing
> > >
> > > Thanks for time, i'd appreciate any opinions
> > >
> > > Johannes
> >
>