spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alec Taylor <>
Subject Re: Spark for core business-logic? - Replacing: MongoDB?
Date Tue, 06 Jan 2015 02:14:50 GMT
Thanks all. To answer your clarification questions:

- I'm writing this in Python
- A similar problem to my actual one is to find common 30 minute slots
(over the next 12 months) [r] that k users have in common. Total
users: n. Given n=10000 and r=17472 then the [naïve] time-complexity
is $\mathcal{O}(nr)$. n*r=17,472,000. I may be able to get
$\mathcal{O}(n \log r)$ if not $\log \log$ from reading the literature
on sequence matching, however this is uncertain.

So assuming all the other business-logic which needs to be built in,
such as authentication and various other CRUD operations, as well as
this more intensive sequence searching operation, what stack would be
best for me?

Thanks for all suggestions

On Mon, Jan 5, 2015 at 4:24 PM, Jörn Franke <> wrote:
> Hallo,
> It really depends on your requirements, what kind of machine learning
> algorithm your budget, if you do currently something really new or integrate
> it with an existing application, etc.. You can run MongoDB as well as a
> cluster. I don't think this question can be answered generally, but depends
> on details of your case.
> Best regards
> Le 4 janv. 2015 01:44, "Alec Taylor" <> a écrit :
>> In the middle of doing the architecture for a new project, which has
>> various machine learning and related components, including:
>> recommender systems, search engines and sequence [common intersection]
>> matching.
>> Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
>> backed by Redis).
>> Though I don't have experience with Hadoop, I was thinking of using
>> Hadoop for the machine-learning (as this will become a Big Data
>> problem quite quickly). To push the data into Hadoop, I would use a
>> connector of some description, or push the MongoDB backups into HDFS
>> at set intervals.
>> However I was thinking that it might be better to put the whole thing
>> in Hadoop, store all persistent data in Hadoop, and maybe do all the
>> layers in Apache Spark (with caching remaining in Redis).
>> Is that a viable option? - Most of what I see discusses Spark (and
>> Hadoop in general) for analytics only. Apache Phoenix exposes a nice
>> interface for read/write over HBase, so I might use that if Spark ends
>> up being the wrong solution.
>> Thanks for all suggestions,
>> Alec Taylor
>> PS: I need this for both "Big" and "Small" data. Note that I am using
>> the Cloudera definition of "Big Data" referring to processing/storage
>> across more than 1 machine.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message