spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <>
Subject Re: spark architecture question -- Pleas Read
Date Sun, 29 Jan 2017 10:18:20 GMT
Hi All,

Thanks for your response .. Please find below flow diagram

Please help me out simplifying this architecture using Spark

1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
or can i directly store it in spark and remove hadoop also?

I want to remove informatica for preprocessing and directly load the files
data coming from server to Hadoop/Spark

So My Question is Can i directly load files data to spark ? Then where
exactly the data will get stored.. Do I need to have Spark installed on Top
of HDFS?

2) if I am retaining below architecture Can I store back output from spark
directly to oracle from step 5 to step 7

and will spark way of storing it back to oracle will be better than using
sqoop performance wise
3)Can I use SPark scala UDF to process data from hive and retain entire

which among the above would be optimal

[image: Inline image 1]

On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <>

> I strongly agree with Jorn and Russell. There are different solutions for
> data movement depending upon your needs frequency, bi-directional drivers.
> workflow, handling duplicate records. This is a space is known as " Change
> Data Capture - CDC" for short. If you need more information, I would be
> happy to chat with you.  I built some products in this space that
> extensively used connection pooling over ODBC/JDBC.
> Happy to chat if you need more information.
> -Sachin Naik
> >>Hard to tell. Can you give more insights >>on what you try to achieve
> and what the data is about?
> >>For example, depending on your use case sqoop can make sense or not.
> Sent from my iPhone
> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <>
> wrote:
> You can treat Oracle as a JDBC source (
> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
> way back out (see the same link) and write directly to Oracle. I'll leave
> the performance questions for someone else.
> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <>
> wrote:
>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <>
>> wrote:
>> Hi Team,
>> RIght now our existing flow is
>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>> Context)-->Destination Hive table -->sqoop export to Oracle
>> Half of the Hive UDFS required is developed in Java UDF..
>> SO Now I want to know if I run the native scala UDF's than runninng hive
>> java udfs in spark-sql will there be any performance difference
>> Can we skip the Sqoop Import and export part and
>> Instead directly load data from oracle to spark and code Scala UDF's for
>> transformations and export output data back to oracle?
>> RIght now the architecture we are using is
>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>> Hive --> Oracle
>> what would be optimal architecture to process data from oracle using
>> spark ?? can i anyway better this process ?
>> Regards,
>> Sirisha

View raw message