I am looking to use Spark to help execute queries against a reasonably large dataset (1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and am looking for some direction as to what I should look at / what may be helpful.
A couple of relevant points:
- The dataset doesn't change over time.
- There are a small number of applications (or queries I guess, but it's more complicated than a single SQL query) that I want to run against it, but the parameters to those queries will change all the time.
- There is a logical grouping of the data per customer, which will generally consist of 1-5000 rows.
I want each query to run as fast as possible (less than a second or two). So ideally I want to keep all the records in memory, but distributed over the different nodes in the cluster. Does this mean sharing a SparkContext between queries, or is this where HDFS comes in, or is there something else that would be better suited?
Or is there another overall approach I should look into for executing queries in "real time" against a dataset this size?