spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex T <>
Subject MLLib ALS question
Date Tue, 30 Sep 2014 17:44:48 GMT
I'm trying to use Matrix Factorization over a dataset with like 6.5M users,
2.5M products and 120M ratings over products. The test is done in standalone
mode, with unique worker (Quad-core and 16 Gb RAM). 

The program runs out of memory, and I think that this happens because
flatMap holds data in memory. 
(I tried with Movielens dataset that has 65k users, 11k movies and 100M
ratings and the test does it without any problem)

Is there any way to make ALS hold the data on disk, instead of memory?

When I was trying the movielens dataset, i noticed that after all the jobs,
the program holds some residual RDD in-memory. Why is that?

And last question (general question), why when I persist RDD with
StorageLevel.DISK_ONLY, unix system monitor shows that Apache Spark uses the
same amount of RAM, as if I persist it in-memory?

Thanks in advance. Hope that is understandable, since it's not my main

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message