Hi ,

I can create a view on these tables but the thing is I am going to need almost every column from these tables and I have faced issues with oracle views on such a large tables which involves joins. Some how oracle used to choose not so correct execution plan.

Can you please tell me how creating a views will help in this scenario?

Can you please tell if I am thinking in right direction?

I have two challenges
1) First to load 2-4 TB of data in spark very quickly.
2) And then keep this data updated in spark whenever DB updates are done.


On Fri, Apr 5, 2019 at 12:35 AM Jason Nerothin <jasonnerothin@gmail.com> wrote:
Hi Prasad, 

Could you create an Oracle-side view that captures only the relevant records and the use Spark JDBC connector to load the view into Spark?

On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao <prasadbhalerao1983@gmail.com> wrote:

I am exploring spark for my Reporting application.
My use case is as follows...
I have 4-5 oracle tables which contains more than 1.5 billion rows. These tables are updated very frequently every day. I don't have choice to change database technology. So this data is going to remain in Oracle only.
To generate 1 report, on an average 15 - 50 million rows has to be fetched from oracle tables. These rows contains some blob columns. Most of the time is spent in fetching these many rows from db over the network. Data processing is not that complex. Currently these report takes around 3-8 hours to complete. I trying to speed up this report generation process.

Can use spark as a caching layer in this case to avoid fetching data from oracle over the network every time? I am thinking to submit a spark job for each report request and use spark SQL to fetch the data and then process it and write to a file? I trying to use kind of data locality in this case.

Whenever a data is updated in oracle tables can I refresh the data in spark storage? I can get the update feed using messaging technology.

Can some one from community help me with this?
Suggestions are welcome.