While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these two methods:
 def insertInto(tableName: String): Unit
 def insertInto(tableName: String, overwrite: Boolean): Unit

Wouldnt these be a nicer way of appending RDD's to a table or are these not recommended as of now? Also will this apply to a table created using the "registerTempTable" method ?

On Thu, Dec 11, 2014 at 6:46 AM, Tathagata Das <> wrote:
First of all, how long do you want to keep doing this? The data is
going to increase infinitely and without any bounds, its going to get
too big for any cluster to handle. If all that is within bounds, then
try the following.

- Maintain a global variable having the current RDD storing all the
log data. We are going to keep updating this variable.
- Every batch interval, take new data and union it with the earlier
unified RDD (in the global variable) and update the global variable.
If you want sequel queries on this data, then you will have
re-register this new RDD as the named table.
- With this approach the number of partitions is going to increase
rapidly. So periodically take the unified RDD and repartition it to a
smaller set of partitions. This messes up the ordering of data, but
you maybe fine with if your queries are order agnostic. Also,
periodically, checkpoint this RDD, otherwise the lineage is going to
grow indefinitely and everything will start getting slower.

Hope this helps.


On Mon, Dec 8, 2014 at 6:29 PM, Xuelin Cao <> wrote:
> Hi,
>       I'm wondering whether there is an  efficient way to continuously
> append new data to a registered spark SQL table.
>       This is what I want:
>       I want to make an ad-hoc query service to a json formated system log.
> Certainly, the system log is continuously generated. I will use spark
> streaming to connect the system log as my input, and I want to find a way to
> effectively append the new data into an existed spark SQL table. Further
> more, I want the whole table being cached in memory/tachyon.
>       It looks like spark sql supports the "INSERT" method, but only for
> parquet file. In addition, it is inefficient to insert a single row every
> time.
>       I do know that somebody build a similar system that I want (ad-hoc
> query service to a on growing system log). So, there must be an efficient
> way. Anyone knows?

To unsubscribe, e-mail:
For additional commands, e-mail:

Rakesh Nair