It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but depends on details of your case.

Best regards

Le 4 janv. 2015 01:44, "Alec Taylor" <alec.taylor6@gmail.com> a écrit :
In the middle of doing the architecture for a new project, which has
various machine learning and related components, including:
recommender systems, search engines and sequence [common intersection]

Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
backed by Redis).

Though I don't have experience with Hadoop, I was thinking of using
Hadoop for the machine-learning (as this will become a Big Data
problem quite quickly). To push the data into Hadoop, I would use a
connector of some description, or push the MongoDB backups into HDFS
at set intervals.

However I was thinking that it might be better to put the whole thing
in Hadoop, store all persistent data in Hadoop, and maybe do all the
layers in Apache Spark (with caching remaining in Redis).

Is that a viable option? - Most of what I see discusses Spark (and
Hadoop in general) for analytics only. Apache Phoenix exposes a nice
interface for read/write over HBase, so I might use that if Spark ends
up being the wrong solution.

Thanks for all suggestions,

Alec Taylor

PS: I need this for both "Big" and "Small" data. Note that I am using
the Cloudera definition of "Big Data" referring to processing/storage
across more than 1 machine.

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org