spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Khare, Ankit" <>
Subject Re: Update / Delete records in Parquet
Date Tue, 23 Apr 2019 08:35:02 GMT
Hi Chetan,

I also agree that for this usecase parquet would not be the best option . I had similar usecase

50 different tables to be download from MSSQL .

Source : MSSQL
Destination. : Apache KUDU (Since it supports very well change data capture use cases)

We used Streamset CDC module to connect to MSSQL and then get CDC data to Apache KUDU

Total records. : 3 B


From: Chetan Khatri <>
Date: Tuesday, 23. April 2019 at 05:58
To: Jason Nerothin <>
Cc: user <>
Subject: Re: Update / Delete records in Parquet

Hello Jason, Thank you for reply. My use case is that, first time I do full load and transformation/aggregation/joins
and write to parquet (as staging) but next time onwards my source is MSSQL Server, I want
to pull only those records got changed / updated and would like to update at parquet also
if possible without side effects.

On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin <<>>
Hi Chetan,

Do you have to use Parquet?

It just feels like it might be the wrong sink for a high-frequency change scenario.

What are you trying to accomplish?


On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri <<>>
Hello All,

If I am doing incremental load / delta and would like to update / delete the records in parquet,
I understands that parquet is immutable and can't be deleted / updated theoretically only
append / overwrite can be done. But I can see utility tools which claims to add value for

Please throw a light.


View raw message