spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Narrell <>
Subject Re: Spark to eliminate full-table scan latency
Date Tue, 28 Oct 2014 17:32:18 GMT
I’ve been puzzled by this lately.  I too would like to use the thrift server to provide JDBC
style access to datasets via SparkSQL.  Is this possible?  The examples show temp tables created
during the lifetime of a SparkContext.  I assume I can use SparkSQL to query those tables
while the context is active, but what happens when the context is stopped?  I can no longer
query this table, via the thrift server.  Do I need Hive in this scenario?  I don’t want
to rebuild the Spark distribution unless absolutely necessary.

From the examples, it looks like SparkSQL is syntax sugar for manipulating an RDD, but if
I need external access to this data, I need a separate store, outside of Spark (Mongo/Cassandra/HDFS/etc..)
 Am I correct here?



> On Oct 27, 2014, at 7:43 PM, Ron Ayoub <> wrote:
> This does look like it provides a good way to allow other process to access the contents
of an RDD in a separate app? Is there any other general purpose mechanism for serving up RDD
data? I understand that the driver app and workers all are app specific and run in separate
executors but would be cool if there was some general way to create a server app based on
Spark. Perhaps Spark SQL is that general way and I'll soon find out. Thanks. 
> From:
> Date: Mon, 27 Oct 2014 14:35:46 -0700
> Subject: Re: Spark to eliminate full-table scan latency
> To:
> CC:
> You can access cached data in spark through the JDBC server:
> On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub < <>>
> We have a table containing 25 features per item id along with feature weights. A correlation
matrix can be constructed for every feature pair based on co-occurrence. If a user inputs
a feature they can find out the features that are correlated with a self-join requiring a
single full table scan. This results in high latency for big data (10 seconds +) due to the
IO involved in the full table scan. My idea is for this feature the data can be loaded into
an RDD and transformations and actions can be applied to find out per query what are the correlated
> I'm pretty sure Spark can do this sort of thing. Since I'm new, what I'm not sure about
is, is Spark appropriate as a server application? For instance, the drive application would
have to load the RDD and then listen for request and return results, perhaps using a socket?
 Are there any libraries to facilitate this sort of Spark server app? So I understand how
Spark can be used to grab data, run algorithms, and put results back but is it appropriate
as the engine of a server app and what are the general patterns involved?

View raw message