spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: Spark Beginner: Correct approach for use case
Date Mon, 06 Mar 2017 00:56:58 GMT
Hi Allan,

Where is the data stored right now? If it's in a relational database, and you are using Spark
with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because
it would be faster to access the data. You could use Sqoop to do that.

In terms of having a long running Spark context, you could look into the Spark job server:

https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

It would allow you to cache all the data in memory and then accept queries via REST API calls.
You would have to refresh your cache as the data changes of course, but it sounds like that
is not very often.

In terms of running the queries themselves, I would think you could use Spark SQL and the
DataFrame/DataSet API, which is built into Spark. You will have to think about the best way
to partition your data, depending on the queries themselves.

Here is a link to the Spark SQL docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html

I hope that helps, and I'm sure other folks will have some helpful advice as well.

Thanks,
Subhash 

Sent from my iPhone

> On Mar 5, 2017, at 3:49 PM, Allan Richards <allan.richards@gmail.com> wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large dataset
(1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and
am looking for some direction as to what I should look at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's more complicated
than a single SQL query) that I want to run against it, but the parameters to those queries
will change all the time.
>  - There is a logical grouping of the data per customer, which will generally consist
of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So ideally
I want to keep all the records in memory, but distributed over the different nodes in the
cluster. Does this mean sharing a SparkContext between queries, or is this where HDFS comes
in, or is there something else that would be better suited?
> 
> Or is there another overall approach I should look into for executing queries in "real
time" against a dataset this size?
> 
> Thanks,
> Allan.

Mime
View raw message