Thanks much for your reply.

By saying on the fly, you mean caching the trained model, and querying it for each user joined with 30M products when needed?

Our question is more about the general approach, what if we have 7M DAU?
How the companies deal with that using Spark?


On Wed, Mar 18, 2015 at 3:39 PM, Sean Owen <sowen@cloudera.com> wrote:
Not just the join, but this means you're trying to compute 600
trillion dot products. It will never finish fast. Basically: don't do
this :) You don't in general compute all recommendations for all
users, but recompute for a small subset of users that were or are
likely to be active soon. (Or compute on the fly.) Is anything like
that an option?

On Wed, Mar 18, 2015 at 7:13 AM, Aram Mkrtchyan
<aram.mkrtchyan.87@gmail.com> wrote:
> Trying to build recommendation system using Spark MLLib's ALS.
>
> Currently, we're trying to pre-build recommendations for all users on daily
> basis. We're using simple implicit feedbacks and ALS.
>
> The problem is, we have 20M users and 30M products, and to call the main
> predict() method, we need to have the cartesian join for users and products,
> which is too huge, and it may take days to generate only the join. Is there
> a way to avoid cartesian join to make the process faster?
>
> Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for
> the data.
>
> val users: RDD[Int] = ???           // RDD with 20M userIds
> val products: RDD[Int] = ???        // RDD with 30M productIds
> val ratings : RDD[Rating] = ???     // RDD with all user->product feedbacks
>
> val model = new ALS().setRank(10).setIterations(10)
>   .setLambda(0.0001).setImplicitPrefs(true)
>   .setAlpha(40).run(ratings)
>
> val usersProducts = users.cartesian(products)
> val recommendations = model.predict(usersProducts)