spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuichiro Sakamoto <ks...@muc.biglobe.ne.jp>
Subject Can't cache RDD of collaborative filtering on MLlib
Date Sun, 08 Mar 2015 16:43:58 GMT
Hello.

I create program, collaborative filtering using Spark,
but I have trouble with calculating speed.

I want to implement recommendation program using ALS (MLlib),
which is another process from Spark.
But access speed of MatrixFactorizationModel object on HDFS is slow,
so I want to cache it, but I can't.

There are 2 processes:

process A:

  1. Create MatrixFactorizationModel by ALS

  2. Save following objects to HDFS
    - MatrixFactorizationModel (on RDD)
    - MatrixFactorizationModel#userFeatures(RDD)
    - MatrixFactorizationModel#productFeatures(RDD)

process B:

  1. Load model information saved by process A.
     # In process B, Master of SparkContext is set to "local"
    ==========
    // Read Model
    JavaRDD<MatrixFactorizationModel> modelRDD =
sparkContext.objectFile("<HDFS path>");
    MatrixFactorizationModel preModel = modelData.first();
    // Read Model's RDD
    JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
    JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
    // Create Model
    MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
        JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
    ==========

  2. Call "predict" method of above MatrixFactorizationModel object.


At number 2 of process B, it is slow speed because objects are read from
HDFS every time.
# I confirmed that the result of recommendation is correct.

So, I tried to cache "productJavaRDD" and "userJavaRDD" as following,
but there was no response from "predict" method.
==========
// Read Model
JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS
path>");
MatrixFactorizationModel preModel = modelData.first();
// Read Model's RDD
JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
sparkContext.objectFile("<HDFS path>");
JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
sparkContext.objectFile("<HDFS path>");
// Cache
productJavaRDD.cache();
userJavaRDD.cache();
// Create Model
MatrixFactorizationModel model = new
MatrixFactorizationModel(preModel.rank(),
    JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
==========

I could not understand why "predict" method was frozen.
Could you please help me how to cache object ?

Thank you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message