I am exploring spark for my Reporting application.
My use case is as follows...
I have 4-5 oracle tables which contains more than 1.5 billion rows. These tables are updated very frequently every day. I don't have choice to change database technology. So this data is going to remain in Oracle only.
To generate 1 report, on an average 15 - 50 million rows has to be fetched from oracle tables. These rows contains some blob columns. Most of the time is spent in fetching these many rows from db over the network. Data processing is not that complex. Currently these report takes around 3-8 hours to complete. I trying to speed up this report generation process.
Can use spark as a caching layer in this case to avoid fetching data from oracle over the network every time? I am thinking to submit a spark job for each report request and use spark SQL to fetch the data and then process it and write to a file? I trying to use kind of data locality in this case.
Whenever a data is updated in oracle tables can I refresh the data in spark storage? I can get the update feed using messaging technology.
Can some one from community help me with this?
Suggestions are welcome.