spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <jlalw...@amazon.com.INVALID>
Subject Re: Refreshing Data in Spark Memory (DataFrames)
Date Fri, 13 Nov 2020 18:24:53 GMT
Is this a streaming application or a batch application?

Normally, for batch applications, you want to keep data consistent. If you have a portfolio
 of mortgages that you are computing payments for and the interest rate changes while you
are computing payments, you don’t want to compute half the mortgages with older interest
rate, and other half with the newer interest rate. And if you run the same mortgages tomorrow,
you don’t want to get completely different results than what you got yesterday. The finance
industry is kind of sensitive about things like this. You can’t just change things willy-nilly
In the past, I’ve worked in fintech for about 8 years, and IMO, I’ve never heard changing
the reference data in middle of a computation as a required thing. I would have given people
heart attacks if I told them that the reference data is changing halfway. I am pretty sure
that there are scenarios where this is required. I have a hard time believing that this is
a common scenario Maybe things in finance have changed in 2020
Normally, any reference data has an “as of date” associated it, and every record being
processed has a time stamp associated with it. You match up your input with reference by matching
the as of date with the timestamp. When the reference data changes, you don’t remove the
old records from reference data, and you add records with the new “as of date”. Essentially,
you keep the history of the reference data. SO, if you have to rerun old computation, your
results don’t change.
There might be scenarios where you want to correct old reference data. In this case you update
your reference table, and rerun your computation.

Now, if you are talking about streaming applications, then it’s a different story. You want
to refresh your reference data. Spark reloads the dataframes from batch sources at the beginning
of every microbatch. As long as you are reading the data from from a non-streaming source,
it will get refreshed in every microbatch. Alternatively, you can send updates to reference
data through a stream, and then merge your historic reference data with the updates that you
are getting from the streaming source.

From: Arti Pande <pande.arti@gmail.com>
Date: Friday, November 13, 2020 at 1:04 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames)


CAUTION: This email originated from outside of the organization. Do not click links or open
attachments unless you can confirm the sender and know the content is safe.


Hi

In the financial systems world, if some data is being updated too frequently, and that data
is to be used as reference data by a Spark job that runs for 6/7 hours, most likely Spark
job may read that data at the beginning and keep it in memory as DataFrame and will keep running
for remaining 6/7 hours. Meanwhile if the reference data is updated by some other system,
then Spark job's in-memory copy of that data (data frame) goes out of sync.

Is there a way to refresh that reference data in Spark memory / dataframe by some means?

This seems to be a very common scenario. Is there a solution / workaround for this?

Thanks & regards,
Arti Pande
Mime
View raw message