spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Nair <raknai...@gmail.com>
Subject Re: Is there an efficient way to append new data to a registered Spark SQL Table?
Date Thu, 11 Dec 2014 19:46:34 GMT
TD,

While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these
two methods:
 def insertInto(tableName: String): Unit
 def insertInto(tableName: String, overwrite: Boolean): Unit

Wouldnt these be a nicer way of appending RDD's to a table or are these not
recommended as of now? Also will this apply to a table created using the
"registerTempTable" method ?


On Thu, Dec 11, 2014 at 6:46 AM, Tathagata Das <tathagata.das1565@gmail.com>
wrote:
>
> First of all, how long do you want to keep doing this? The data is
> going to increase infinitely and without any bounds, its going to get
> too big for any cluster to handle. If all that is within bounds, then
> try the following.
>
> - Maintain a global variable having the current RDD storing all the
> log data. We are going to keep updating this variable.
> - Every batch interval, take new data and union it with the earlier
> unified RDD (in the global variable) and update the global variable.
> If you want sequel queries on this data, then you will have
> re-register this new RDD as the named table.
> - With this approach the number of partitions is going to increase
> rapidly. So periodically take the unified RDD and repartition it to a
> smaller set of partitions. This messes up the ordering of data, but
> you maybe fine with if your queries are order agnostic. Also,
> periodically, checkpoint this RDD, otherwise the lineage is going to
> grow indefinitely and everything will start getting slower.
>
> Hope this helps.
>
> TD
>
> On Mon, Dec 8, 2014 at 6:29 PM, Xuelin Cao <xuelincao@yahoo.com.invalid>
> wrote:
> >
> > Hi,
> >
> >       I'm wondering whether there is an  efficient way to continuously
> > append new data to a registered spark SQL table.
> >
> >       This is what I want:
> >       I want to make an ad-hoc query service to a json formated system
> log.
> > Certainly, the system log is continuously generated. I will use spark
> > streaming to connect the system log as my input, and I want to find a
> way to
> > effectively append the new data into an existed spark SQL table. Further
> > more, I want the whole table being cached in memory/tachyon.
> >
> >       It looks like spark sql supports the "INSERT" method, but only for
> > parquet file. In addition, it is inefficient to insert a single row every
> > time.
> >
> >       I do know that somebody build a similar system that I want (ad-hoc
> > query service to a on growing system log). So, there must be an efficient
> > way. Anyone knows?
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

-- 
Regards
Rakesh Nair

Mime
View raw message