spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <>
Subject Re: Support for SQL on unions of tables (merge tables?)
Date Tue, 20 Jan 2015 21:15:40 GMT
I think you can resort to a Hive table partitioned by date

On 1/11/15 9:51 PM, Paul Wais wrote:
> Dear List,
> What are common approaches for addressing over a union of tables / 
> RDDs?  E.g. suppose I have a collection of log files in HDFS, one log 
> file per day, and I want to compute the sum of some field over a date 
> range in SQL.  Using log schema, I can read each as a distinct 
> SchemaRDD, but I want to union them all and query against one 'table'.
> If this data were in MySQL, I could have a table for each day of data 
> and use a MyISAM merge table to union these tables together and just 
> query against the merge table.  What's nice here is that MySQL 
> persists the merge table, and the merge table is r/w, so one can just 
> update the merge table once per day.  (What's not nice is that merge 
> tables scale poorly, backup admin is a pain, and oh hey I'd like to 
> use Spark not MySQL).
> One naive and untested idea (that achieves implicit persistence): scan 
> an HDFS directory for log files, create one RDD per file, union() the 
> RDDs, then create a Schema RDD from that union().
> A few specific questions:
>  * Any good approaches to a merge / union table? (Other than the naive 
> idea above).  Preferably with some way to persist that table / RDD 
> between Spark runs.  (How does Impala approach this problem?)
>  * Has anybody tried joining against such a union of tables / RDDs on 
> a very large amount of data?  When I've tried (non-spark-sql) 
> union()ing Sequence Files, and then join()ing them against another 
> RDD, Spark seems to try to compute the full union before doing any 
> join() computation (and eventually OOMs the cluster because the union 
> of Sequence Files is so big). I haven't tried something similar with 
> Spark SQL.
>  * Are there any plans related to this in the Spark roadmap?  (This 
> feature would be a nice compliment to, say, persistent RDD indices for 
> interactive querying).
>  * Related question: are there plans to use Parquet Index Pages to 
> make Spark SQL faster?  E.g. log indices over date ranges would be 
> relevant here.
> All the best,
> -Paul

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message