spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: Spark Beginner: Correct approach for use case
Date Mon, 06 Mar 2017 02:11:52 GMT
Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for quering the data. This is more of a database/NoSQL kind of use case
than Spark (which is more of distributed processing engine,I reckon).

On Mon, Mar 6, 2017 at 11:56 AM, Subhash Sriram <>

> Hi Allan,
> Where is the data stored right now? If it's in a relational database, and
> you are using Spark with Hadoop, I feel like it would make sense to move
> the import the data into HDFS, just because it would be faster to access
> the data. You could use Sqoop to do that.
> In terms of having a long running Spark context, you could look into the
> Spark job server:
> It would allow you to cache all the data in memory and then accept queries
> via REST API calls. You would have to refresh your cache as the data
> changes of course, but it sounds like that is not very often.
> In terms of running the queries themselves, I would think you could use
> Spark SQL and the DataFrame/DataSet API, which is built into Spark. You
> will have to think about the best way to partition your data, depending on
> the queries themselves.
> Here is a link to the Spark SQL docs:
> I hope that helps, and I'm sure other folks will have some helpful advice
> as well.
> Thanks,
> Subhash
> Sent from my iPhone
> On Mar 5, 2017, at 3:49 PM, Allan Richards <>
> wrote:
> Hi,
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
> Thanks,
> Allan.

Best Regards,
Ayan Guha

View raw message