spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chetan Khatri <>
Subject Suggestion on Join Approach with Spark
Date Wed, 15 May 2019 13:58:23 GMT
Hello Spark Developers,

I have a question on Spark Join I am doing.

I have a full load data from RDBMS and storing at HDFS let's say,

val historyDF =*"/home/test/transaction-line-item"*)

and I am getting changed data at seperate hdfs path,let's say;

val deltaDF ="/home/test/transaction-line-item-delta")

Now I would like to take rows from deltaDF and ignore only those
records from historyDF, and write to some MySQL table.

Once I am done with writing to MySQL table, I would like to update
*/home/test/transaction-line-item *as overwrite. Now I can't just

overwrite because lazy evaluation and DAG structure unless write to
somewhere else and then write back as overwrite.

val syncDataDF =
"sys_change_column"), Seq("TRANSACTION_BY_LINE_ID"),

val mergedDataDF = syncDataDF.union(deltaDF)

I believe, Without doing *union *, only with Join this can be done.
Please suggest best approach.

As I can't write back *mergedDataDF * to the path of historyDF,
because from there I am only reading. What I am doing is to write at

path and then read  from there and write back! Which is bad Idea, I
need suggestion here...

val tempMergedDF ="home/test/transaction-line-item-temp/")

Please suggest me best approach.


View raw message