spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: spark architecture question -- Pleas Read
Date Sun, 29 Jan 2017 13:45:17 GMT
One alternative could be the oracle Hadoop loader and other Oracle products, but you have to
invest some money and probably buy their Hadoop Appliance, which you have to evaluate if it
make sense (can get expensive with large clusters etc).

Another alternative would be to get rid of Oracle alltogether and use other databases.

However, can you elaborate a little bit on your use case and the business logic as well as
SLA requires. Otherwise all recommendations are right because the requirements you presented
are very generic.

About get rid of Hadoop - this depends! You will need some resource manager (yarn, mesos,
kubernetes etc) and most likely also a distributed file system. Spark supports through the
Hadoop apis a wide range of file systems, but does not need HDFS for persistence. You can
have local filesystem (ie any file system mounted to a node, so also distributed ones, such
as zfs), cloud file systems (s3, azure blob etc).



> On 29 Jan 2017, at 11:18, Alex <siri8123@gmail.com> wrote:
> 
> Hi All,
> 
> Thanks for your response .. Please find below flow diagram
> 
> Please help me out simplifying this architecture using Spark
> 
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
> 
> I want to remove informatica for preprocessing and directly load the files data coming
from server to Hadoop/Spark
> 
> So My Question is Can i directly load files data to spark ? Then where exactly the data
will get stored.. Do I need to have Spark installed on Top of HDFS?
> 
> 2) if I am retaining below architecture Can I store back output from spark directly to
oracle from step 5 to step 7 
> 
> and will spark way of storing it back to oracle will be better than using sqoop performance
wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire architecture

> 
> which among the above would be optimal
> 
> 
> 
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.naik@gmail.com> wrote:
>> I strongly agree with Jorn and Russell. There are different solutions for data movement
depending upon your needs frequency, bi-directional drivers. workflow, handling duplicate
records. This is a space is known as " Change Data Capture - CDC" for short. If you need more
information, I would be happy to chat with you.  I built some products in this space that
extensively used connection pooling over ODBC/JDBC. 
>> 
>> Happy to chat if you need more information. 
>> 
>> -Sachin Naik
>> 
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>> 
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spitzer@gmail.com>
wrote:
>>> 
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back
out (see the same link) and write directly to Oracle. I'll leave the performance questions
for someone else. 
>>> 
>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8123@gmail.com>
wrote:
>>>> 
>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8123@gmail.com>
wrote:
>>>> Hi Team,
>>>> 
>>>> RIght now our existing flow is
>>>> 
>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive Context)-->Destination
Hive table -->sqoop export to Oracle
>>>> 
>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>> 
>>>> SO Now I want to know if I run the native scala UDF's than runninng hive
java udfs in spark-sql will there be any performance difference
>>>> 
>>>> 
>>>> Can we skip the Sqoop Import and export part and 
>>>> 
>>>> Instead directly load data from oracle to spark and code Scala UDF's for
transformations and export output data back to oracle?
>>>> 
>>>> RIght now the architecture we are using is
>>>> 
>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
Hive --> Oracle 
>>>> what would be optimal architecture to process data from oracle using spark
?? can i anyway better this process ?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Sirisha 
>>>> 
> 

Mime
View raw message