spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Can't cache RDD of collaborative filtering on MLlib
Date Mon, 09 Mar 2015 19:05:15 GMT
cache() is lazy. The data is stored into memory after the first time
it gets materialized. So the first time you call `predict` after you
load the model back from HDFS, it still takes time to load the actual
data. The second time will be much faster. Or you can call
`userJavaRDD.count()` and `productJavaRDD.count()` explicitly to load
both into memory before you create the model. -Xiangrui

On Sun, Mar 8, 2015 at 9:43 AM, Yuichiro Sakamoto
<ksooj@muc.biglobe.ne.jp> wrote:
> Hello.
>
> I create program, collaborative filtering using Spark,
> but I have trouble with calculating speed.
>
> I want to implement recommendation program using ALS (MLlib),
> which is another process from Spark.
> But access speed of MatrixFactorizationModel object on HDFS is slow,
> so I want to cache it, but I can't.
>
> There are 2 processes:
>
> process A:
>
>   1. Create MatrixFactorizationModel by ALS
>
>   2. Save following objects to HDFS
>     - MatrixFactorizationModel (on RDD)
>     - MatrixFactorizationModel#userFeatures(RDD)
>     - MatrixFactorizationModel#productFeatures(RDD)
>
> process B:
>
>   1. Load model information saved by process A.
>      # In process B, Master of SparkContext is set to "local"
>     ==========
>     // Read Model
>     JavaRDD<MatrixFactorizationModel> modelRDD =
> sparkContext.objectFile("<HDFS path>");
>     MatrixFactorizationModel preModel = modelData.first();
>     // Read Model's RDD
>     JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
> sparkContext.objectFile("<HDFS path>");
>     JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
> sparkContext.objectFile("<HDFS path>");
>     // Create Model
>     MatrixFactorizationModel model = new
> MatrixFactorizationModel(preModel.rank(),
>         JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
>     ==========
>
>   2. Call "predict" method of above MatrixFactorizationModel object.
>
>
> At number 2 of process B, it is slow speed because objects are read from
> HDFS every time.
> # I confirmed that the result of recommendation is correct.
>
> So, I tried to cache "productJavaRDD" and "userJavaRDD" as following,
> but there was no response from "predict" method.
> ==========
> // Read Model
> JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS
> path>");
> MatrixFactorizationModel preModel = modelData.first();
> // Read Model's RDD
> JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
> sparkContext.objectFile("<HDFS path>");
> JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
> sparkContext.objectFile("<HDFS path>");
> // Cache
> productJavaRDD.cache();
> userJavaRDD.cache();
> // Create Model
> MatrixFactorizationModel model = new
> MatrixFactorizationModel(preModel.rank(),
>     JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
> ==========
>
> I could not understand why "predict" method was frozen.
> Could you please help me how to cache object ?
>
> Thank you.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message