spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Y. Sakamoto" <ks...@muc.biglobe.ne.jp>
Subject Re: Can't cache RDD of collaborative filtering on MLlib
Date Wed, 11 Mar 2015 16:01:17 GMT
Hello.

I tried `count()`, then `userJavaRDD` and `productJavaRDD` were cached,
and the speed became faster.

Thank you.


On 2015/03/10 4:05, Xiangrui Meng wrote:
> cache() is lazy. The data is stored into memory after the first time
> it gets materialized. So the first time you call `predict` after you
> load the model back from HDFS, it still takes time to load the actual
> data. The second time will be much faster. Or you can call
> `userJavaRDD.count()` and `productJavaRDD.count()` explicitly to load
> both into memory before you create the model. -Xiangrui
>
> On Sun, Mar 8, 2015 at 9:43 AM, Yuichiro Sakamoto
> <ksooj@muc.biglobe.ne.jp> wrote:
>> Hello.
>>
>> I create program, collaborative filtering using Spark,
>> but I have trouble with calculating speed.
>>
>> I want to implement recommendation program using ALS (MLlib),
>> which is another process from Spark.
>> But access speed of MatrixFactorizationModel object on HDFS is slow,
>> so I want to cache it, but I can't.
>>
>> There are 2 processes:
>>
>> process A:
>>
>>    1. Create MatrixFactorizationModel by ALS
>>
>>    2. Save following objects to HDFS
>>      - MatrixFactorizationModel (on RDD)
>>      - MatrixFactorizationModel#userFeatures(RDD)
>>      - MatrixFactorizationModel#productFeatures(RDD)
>>
>> process B:
>>
>>    1. Load model information saved by process A.
>>       # In process B, Master of SparkContext is set to "local"
>>      ==========
>>      // Read Model
>>      JavaRDD<MatrixFactorizationModel> modelRDD =
>> sparkContext.objectFile("<HDFS path>");
>>      MatrixFactorizationModel preModel = modelData.first();
>>      // Read Model's RDD
>>      JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
>> sparkContext.objectFile("<HDFS path>");
>>      JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
>> sparkContext.objectFile("<HDFS path>");
>>      // Create Model
>>      MatrixFactorizationModel model = new
>> MatrixFactorizationModel(preModel.rank(),
>>          JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
>>      ==========
>>
>>    2. Call "predict" method of above MatrixFactorizationModel object.
>>
>>
>> At number 2 of process B, it is slow speed because objects are read from
>> HDFS every time.
>> # I confirmed that the result of recommendation is correct.
>>
>> So, I tried to cache "productJavaRDD" and "userJavaRDD" as following,
>> but there was no response from "predict" method.
>> ==========
>> // Read Model
>> JavaRDD<MatrixFactorizationModel> modelRDD = sparkContext.objectFile("<HDFS
>> path>");
>> MatrixFactorizationModel preModel = modelData.first();
>> // Read Model's RDD
>> JavaRDD<Tuple2&lt;Object, double[]>> productJavaRDD =
>> sparkContext.objectFile("<HDFS path>");
>> JavaRDD<Tuple2&lt;Object, double[]>> userJavaRDD =
>> sparkContext.objectFile("<HDFS path>");
>> // Cache
>> productJavaRDD.cache();
>> userJavaRDD.cache();
>> // Create Model
>> MatrixFactorizationModel model = new
>> MatrixFactorizationModel(preModel.rank(),
>>      JavaRDD.toRDD(userJavaRDD), JavaRDD.toRDD(productJavaRDD));
>> ==========
>>
>> I could not understand why "predict" method was frozen.
>> Could you please help me how to cache object ?
>>
>> Thank you.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>


-- 
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*
     Yuichiro SAKAMOTO
         - ksooj@muc.biglobe.ne.jp
         - phonypianist@gmail.com
         - http://www2u.biglobe.ne.jp/~yuichi/
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message