spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chetan Khatri <chetan.opensou...@gmail.com>
Subject Re: Update / Delete records in Parquet
Date Fri, 03 May 2019 10:33:15 GMT
Agreed with delta.io, I am exploring both options

On Wed, May 1, 2019 at 2:50 PM Vitaliy Pisarev <vitaliy.pisarev@biocatch.com>
wrote:

> Ankit, you should take a look at delta.io that was recently open sourced
> by databricks.
>
> Full DML support is on the way.
>
>
>
> *From: *"Khare, Ankit" <ankit.khare@eon.com>
> *Date: *Tuesday, 23 April 2019 at 11:35
> *To: *Chetan Khatri <chetan.opensource@gmail.com>, Jason Nerothin <
> jasonnerothin@gmail.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: Update / Delete records in Parquet
>
>
>
> Hi Chetan,
>
>
>
> I also agree that for this usecase parquet would not be the best option .
> I had similar usecase ,
>
>
>
> 50 different tables to be download from MSSQL .
>
>
>
> Source : MSSQL
>
> Destination. : Apache KUDU (Since it supports very well change data
> capture use cases)
>
>
>
> We used Streamset CDC module to connect to MSSQL and then get CDC data to
> Apache KUDU
>
>
>
> Total records. : 3 B
>
>
>
> Thanks
>
> Ankit
>
>
>
>
>
> *From: *Chetan Khatri <chetan.opensource@gmail.com>
> *Date: *Tuesday, 23. April 2019 at 05:58
> *To: *Jason Nerothin <jasonnerothin@gmail.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: Update / Delete records in Parquet
>
>
>
> Hello Jason, Thank you for reply. My use case is that, first time I do
> full load and transformation/aggregation/joins and write to parquet (as
> staging) but next time onwards my source is MSSQL Server, I want to pull
> only those records got changed / updated and would like to update at
> parquet also if possible without side effects.
>
>
> https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017
>
>
>
> On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin <jasonnerothin@gmail.com>
> wrote:
>
> Hi Chetan,
>
>
>
> Do you have to use Parquet?
>
>
>
> It just feels like it might be the wrong sink for a high-frequency change
> scenario.
>
>
>
> What are you trying to accomplish?
>
>
>
> Thanks,
> Jason
>
>
>
> On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri <chetan.opensource@gmail.com>
> wrote:
>
> Hello All,
>
>
>
> If I am doing incremental load / delta and would like to update / delete
> the records in parquet, I understands that parquet is immutable and can't
> be deleted / updated theoretically only append / overwrite can be done. But
> I can see utility tools which claims to add value for that.
>
>
>
> https://github.com/Factual/parquet-rewriter
>
>
>
> Please throw a light.
>
>
>
> Thanks
>
>
>
>
> --
>
> Thanks,
>
> Jason
>
>

Mime
View raw message