spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Is Spark right for us?
Date Mon, 07 Mar 2016 08:05:46 GMT
I think the Relational Database will be faster for ordinal data (eg where you answer from 1..x).
For free text fields I would recommend solr or elastic search, because they have a lot more
text analytics capabilities that do not exist in a relational database or MongoDB and are
not likely to be there in the near future.

> On 06 Mar 2016, at 18:25, Guillaume Bilodeau <> wrote:
> The data is currently stored in a relational database, but a migration to a document-oriented
database such as MongoDb is something we are definitely considering.  How does this factor
>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <>
>> Hi,
>> That depends on a lot of things, but as a starting point I would ask whether you
are planning to store your data in JSON format?
>> Regards,
>> Gourav Sengupta
>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <>
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>> We are dealing with rather large data sets - although reading the internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>> The naive algorithm for report generation can be summed up as this:
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>>     calculate data point
>>>     add data point to report
>>>   }
>>>   publish report
>>> }
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as possible.
>>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
>>> HazelCast and several others, and am still confused as to how each of these
>>> can help us and how they fit together.
>>> Is Spark the right framework for us?
>>> --
>>> View this message in context:
>>> Sent from the Apache Spark User List mailing list archive at
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

View raw message