Dr Mich Talebzadeh
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS.
Then I will do delta loads everyday in the same folder in HDFS.
Now, my query here is,
DAY 0 - I did the initial load (full dump).
DAY 1 - I'll load only that day's data which has suppose 10 records (5 old with some column's value altered and 5 new).
Here, my question is, how will I push this file to HDFS through Spark code, if I do append, it will create duplicates (which i don't want), if i keep separate files and while using it in other program am giving the path of it as folder which contains all files /. But in this case also the registerTempTable will have duplicates for those 5 old rows.
What is the BEST logic to be applied here?
I tried to resolve this by doing a search in that file of the records if matching load the new ones by deleting the old, but this will be time consuming for such a huge record, right?