I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS.
Then I will do delta loads everyday in the same folder in HDFS.
Now, my query here is,
DAY 0 - I did the initial load (full dump).
DAY 1 - I'll load only that day's data which has suppose 10 records (5 old with some column's value altered and 5 new).
Here, my question is, how will I push this file to HDFS through Spark code, if I do append, it will create duplicates (which i don't want), if i keep separate files and while using it in other program am giving the path of it as folder which contains all files /. But in this case also the registerTempTable will have duplicates for those 5 old rows.
What is the BEST logic to be applied here?
I tried to resolve this by doing a search in that file of the records if matching load the new ones by deleting the old, but this will be time consuming for such a huge record, right?