spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chunnan Yao <yaochun...@gmail.com>
Subject Support parallelized online matrix factorization for Collaborative Filtering
Date Mon, 06 Apr 2015 06:48:33 GMT
On-line Collaborative Filtering(CF) has been widely used and studied. To
re-train a CF model from scratch every time when new data comes in is very
inefficient
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
However, in Spark community we see few discussion about collaborative
filtering on streaming data. Given streaming k-means, streaming logistic
regression, and the on-going incremental model training of Naive Bayes
Classifier (SPARK-4144), we think it is meaningful to consider streaming
Collaborative Filtering support on MLlib.

I've created an issue on JIRA (SPARK-6711) for possible discussions. We
suggest to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on
SGD instead of ALS, which is easier to be tackled under streaming data.

Fortunately, the authors of this paper have implemented their algorithm as a
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream

Please don't hesitate to give your opinions on this issue and our planned
approach. We'd like to work on this in the next few weeks. 



-----
Feel the sparking Spark!
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Support-parallelized-online-matrix-factorization-for-Collaborative-Filtering-tp11413.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message