spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishna Sankar <>
Subject Re: Spark for core business-logic? - Replacing: MongoDB?
Date Sun, 04 Jan 2015 04:40:42 GMT
   Good questions. Suggestions:

   1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer,
   Cache, Queue, App Server, App (Interface), App (backend ML) et al.
   2. Then slot-in the appropriate technologies - may be even multiple
   technologies for the same layer and then work thru the pros and cons.
   3. Looking at the layers (moving from the easy to difficult, the mundane
   to the esoteric ;o)):
      - Cache & Queue - stick with what you are comfortable with ie Redis
      et al. Also take a look at Kafka
      - App Server - Tomcat et al
      - App (Interface) - JavaScript et al
      - DB, SQL Layer - Better off with with MongoDB. You can explore
      HBase, but it is not the same.
         - The same way as MongoDB != mySQL, HBase != MongoDB
      - Machine Learning Server/Layer - Spark would fit very well here.
      - Machine Learning DFS, Data Store - HDFS
      - The idea of pushing the data to Hadoop for ML is good
         - But you need to think thru things like incremental data load,
         semantics like at least once, at most once et al.
      4. You could architect all with the Hadoop eco system. It might work,
   depending on the system.
      - But I would use caution. Most probably many of the elements would
      rather be implemented in appropriate technologies.
      5. Doubleclick couple more times on the design, think thru the
   functionality, scaling requirements et al
      - Draw 3 or 4 alternatives and jot down the top 5 requirements, pros
      and cons, the knowns and the unknowns
      - The optimum design will fall thru


On Sat, Jan 3, 2015 at 4:43 PM, Alec Taylor <> wrote:

> In the middle of doing the architecture for a new project, which has
> various machine learning and related components, including:
> recommender systems, search engines and sequence [common intersection]
> matching.
> Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
> backed by Redis).
> Though I don't have experience with Hadoop, I was thinking of using
> Hadoop for the machine-learning (as this will become a Big Data
> problem quite quickly). To push the data into Hadoop, I would use a
> connector of some description, or push the MongoDB backups into HDFS
> at set intervals.
> However I was thinking that it might be better to put the whole thing
> in Hadoop, store all persistent data in Hadoop, and maybe do all the
> layers in Apache Spark (with caching remaining in Redis).
> Is that a viable option? - Most of what I see discusses Spark (and
> Hadoop in general) for analytics only. Apache Phoenix exposes a nice
> interface for read/write over HBase, so I might use that if Spark ends
> up being the wrong solution.
> Thanks for all suggestions,
> Alec Taylor
> PS: I need this for both "Big" and "Small" data. Note that I am using
> the Cloudera definition of "Big Data" referring to processing/storage
> across more than 1 machine.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message