spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Ricardo <julianricardo...@gmail.com>
Subject MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?
Date Fri, 16 Jan 2015 11:03:51 GMT
I am trying to use Spark MLib ALS with implicit feedback for collaborative
filtering. Input data has only two fields `userId` and `productId`. I have
**no product ratings**, just info on what products users have bought, that's
all. So to train ALS I use:
 
    def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int):
MatrixFactorizationModel

(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)

This API requires `Rating` object:

    Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on `trainImplicit` tells: *Train a matrix
factorization model given an RDD of 'implicit preferences' ratings given by
users to some products, in the form of (userID, productID, **preference**)
pairs.*
 
When I set rating / preferences to `1` as in:
 
    val ratings = sc.textFile(new File(dir, file).toString).map { line =>
      val fields = line.split(",")
      // format: (randomNumber, Rating(userId, productId, rating))
      (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
    }

     val training = ratings.filter(x => x._1 < 60)
      .values
      .repartition(numPartitions)
      .cache()
    val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
      .values
      .repartition(numPartitions)
      .cache()
    val test = ratings.filter(x => x._1 >= 80).values.cache()


And then train ALSL:

     val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
value:

    val validationRmse = computeRmse(model, validation, numValidation)

    /** Compute RMSE (Root Mean Squared Error). */
     def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n:
Long): Double = {
    val predictions: RDD[Rating] = model.predict(data.map(x => (x.user,
x.product)))
    val predictionsAndRatings = predictions.map(x => ((x.user, x.product),
x.rating))
      .join(data.map(x => ((x.user, x.product), x.rating)))
      .values
    math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 -
x._2)).reduce(_ + _) / n)
    }

So my question is: to what value should I set `rating` in:

    Rating(user: Int, product: Int, rating: Double)

for implicit training (in `ALS.trainImplicit` method) ?

**Update**

With:

      val alpha = 40
      val lambda = 0.01

I get:

    Got 1895593 ratings from 17471 users on 462685 products.
    Training: 1136079, validation: 380495, test: 379019
    RMSE (validation) = 0.7537217888106758 for the model trained with rank =
8 and numIter = 10.
    RMSE (validation) = 0.7489005441881798 for the model trained with rank =
8 and numIter = 20.
    RMSE (validation) = 0.7387672873747732 for the model trained with rank =
12 and numIter = 10.
    RMSE (validation) = 0.7310003522283959 for the model trained with rank =
12 and numIter = 20.
    The best model was trained with rank = 12, and numIter = 20, and its
RMSE on the test set is 0.7302343904091481.
    baselineRmse: 0.0 testRmse: 0.7302343904091481
    The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline improvement
where baseline model is simply mean (1).




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLib-How-to-set-preferences-for-ALS-implicit-feedback-in-Collaborative-Filtering-tp21185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message