Hi all,

I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS.

Then I will do delta loads everyday in the same folder in HDFS.

Now, my query here is,

DAY 0 - I did the initial load (full dump).

DAY 1 - I'll load only that day's data which has suppose 10 records (5 old with some column's value altered and 5 new).

Here, my question is, how will I push this file to HDFS through Spark code, if I do append, it will create duplicates (which i don't want), if i keep separate files and while using it in other program am giving the path of it as folder which contains all files /. But in this case also the registerTempTable will have duplicates for those 5 old rows.

What is the BEST logic to be applied here?

I tried to resolve this by doing a search in that file of the records if matching load the new ones by deleting the old, but this will be time consuming for such a huge record, right?

Please help!

Thanks,
Aakash.