spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <>
Subject Little idea needed
Date Tue, 19 Jul 2016 19:27:57 GMT
Hi all,

I'm trying to pull a full table from oracle, which is huge with some 10
million records which will be the initial load to HDFS.

Then I will do delta loads everyday in the same folder in HDFS.

Now, my query here is,

DAY 0 - I did the initial load (full dump).

DAY 1 - I'll load only that day's data which has suppose 10 records (5 old
with some column's value altered and 5 new).

Here, my question is, how will I push this file to HDFS through Spark code,
if I do append, it will create duplicates (which i don't want), if i keep
separate files and while using it in other program am giving the path of it
as folder which contains all files /. But in this case also the
registerTempTable will have duplicates for those 5 old rows.

What is the BEST logic to be applied here?

I tried to resolve this by doing a search in that file of the records if
matching load the new ones by deleting the old, but this will be time
consuming for such a huge record, right?

Please help!


View raw message