spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex T <chiorts...@gmail.com>
Subject MLLib ALS question
Date Tue, 30 Sep 2014 17:44:48 GMT
Hi, 
I'm trying to use Matrix Factorization over a dataset with like 6.5M users,
2.5M products and 120M ratings over products. The test is done in standalone
mode, with unique worker (Quad-core and 16 Gb RAM). 

The program runs out of memory, and I think that this happens because
flatMap holds data in memory. 
(I tried with Movielens dataset that has 65k users, 11k movies and 100M
ratings and the test does it without any problem)

Is there any way to make ALS hold the data on disk, instead of memory?

When I was trying the movielens dataset, i noticed that after all the jobs,
the program holds some residual RDD in-memory. Why is that?

And last question (general question), why when I persist RDD with
StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the
same amount of RAM, as if I persist it in-memory?

Thanks in advance. Hope that is understandable, since it's not my main
language.







--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-tp15420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message