spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Spark Beginner: Correct approach for use case
Date Mon, 06 Mar 2017 07:05:06 GMT
I agree with the others that a dedicated NoSQL datastore can make sense. You should look at
the lambda architecture paradigm. Keep in mind that more memory does not necessarily mean
more performance. It is the right data structure for  the queries of your users. Additionally,
if your queries are executed over the whole dataset and you want to have answer times in 2
seconds, you should look at databases that do aggregations on samples of the data (cf.
E.g. Hive has a tablesample functionality since a long time.

> On 5 Mar 2017, at 21:49, Allan Richards <> wrote:
> Hi,
> I am looking to use Spark to help execute queries against a reasonably large dataset
(1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and
am looking for some direction as to what I should look at / what may be helpful.
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's more complicated
than a single SQL query) that I want to run against it, but the parameters to those queries
will change all the time.
>  - There is a logical grouping of the data per customer, which will generally consist
of 1-5000 rows.
> I want each query to run as fast as possible (less than a second or two). So ideally
I want to keep all the records in memory, but distributed over the different nodes in the
cluster. Does this mean sharing a SparkContext between queries, or is this where HDFS comes
in, or is there something else that would be better suited?
> Or is there another overall approach I should look into for executing queries in "real
time" against a dataset this size?
> Thanks,
> Allan.

View raw message