Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or compute on the fly.) Is anything like
that an option?
On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
> Trying to build recommendation system using Spark MLLib's ALS.
> Currently, we're trying to pre-build recommendations for all users on daily
> basis. We're using simple implicit feedbacks and ALS.
> The problem is, we have 20M users and 30M products, and to call the main
> predict() method, we need to have the cartesian join for users and products,
> which is too huge, and it may take days to generate only the join. Is there
> a way to avoid cartesian join to make the process faster?
> Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
> the data.
> val users: RDD[Int] = ??? // RDD with 20M userIds
> val products: RDD[Int] = ??? // RDD with 30M productIds
> val ratings : RDD[Rating] = ??? // RDD with all user->product feedbacks
> val model = new ALS().setRank(10).setIterations(10)
> val usersProducts = users.cartesian(products)
> val recommendations = model.predict(usersProducts)