spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Quick but probably silly question...
Date Tue, 17 Jan 2017 16:57:46 GMT
You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and
then you compare the original and delta file and create a new version. When querying before
compaction then you need to check in original and delta file. I don to think orc need tez
for it , but it probably improves performance.

> On 17 Jan 2017, at 17:21, Michael Segel <msegel_hadoop@hotmail.com> wrote:
> 
> Hi, 
> While the parquet file is immutable and the data sets are immutable, how does sparkSQL
handle updates or deletes? 
> I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then
persist it, I now have two files. If I reread the table back in … will I see duplicates
or not? 
> 
> The larger issue is how to handle mutable data in a multi-user / multi-tenant situation
and using Parquet as the storage. 
> 
> Would this be the right tool? 
> 
> W.R.T ORC files, mutation is handled by Tez. 
> 
> Thanks in Advance, 
> 
> -Mike
> 
> ТÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÐÐ¥FòVç7V'67&–&RRÖÖ–âW6W"×Vç7V'67&–&T7&²æ6†Ræ÷&pÐ


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message