spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasad Bhalerao <>
Subject reporting use case
Date Thu, 04 Apr 2019 18:48:11 GMT

I am exploring spark for my Reporting application.
My use case is as follows...
I have 4-5 oracle tables which contains more than 1.5 billion rows. These
tables are updated very frequently every day. I don't have choice to change
database technology. So this data is going to remain in Oracle only.
To generate 1 report, on an average 15 - 50 million rows has to be fetched
from oracle tables. These rows contains some blob columns. Most of the time
is spent in fetching these many rows from db over the network. Data
processing is not that complex. Currently these report takes around 3-8
hours to complete. I trying to speed up this report generation process.

Can use spark as a caching layer in this case to avoid fetching data from
oracle over the network every time? I am thinking to submit a spark job for
each report request and use spark SQL to fetch the data and then process it
and write to a file? I trying to use kind of data locality in this case.

Whenever a data is updated in oracle tables can I refresh the data in spark
storage? I can get the update feed using messaging technology.

Can some one from community help me with this?
Suggestions are welcome.



View raw message