spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: spark architecture question -- Pleas Read
Date Sun, 29 Jan 2017 22:22:02 GMT
You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory ) .... a Spark cluster itself
does not store anything it just processes. 

> On 29 Jan 2017, at 15:37, Alex <siri8123@gmail.com> wrote:
> 
> But for persistance after intermediate processing can i use spark cluster itself or i
have to use hadoop cluster?!
> 
> On Jan 29, 2017 7:36 PM, "Deepak Sharma" <deepakmca05@gmail.com> wrote:
> The better way is to read the data directly into spark using spark sql read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
> 
> Thanks
> Deepak
> 
>> On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>> One alternative could be the oracle Hadoop loader and other Oracle products, but
you have to invest some money and probably buy their Hadoop Appliance, which you have to evaluate
if it make sense (can get expensive with large clusters etc).
>> 
>> Another alternative would be to get rid of Oracle alltogether and use other databases.
>> 
>> However, can you elaborate a little bit on your use case and the business logic as
well as SLA requires. Otherwise all recommendations are right because the requirements you
presented are very generic.
>> 
>> About get rid of Hadoop - this depends! You will need some resource manager (yarn,
mesos, kubernetes etc) and most likely also a distributed file system. Spark supports through
the Hadoop apis a wide range of file systems, but does not need HDFS for persistence. You
can have local filesystem (ie any file system mounted to a node, so also distributed ones,
such as zfs), cloud file systems (s3, azure blob etc).
>> 
>> 
>> 
>>> On 29 Jan 2017, at 11:18, Alex <siri8123@gmail.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> Thanks for your response .. Please find below flow diagram
>>> 
>>> Please help me out simplifying this architecture using Spark
>>> 
>>> 1) Can i skip step 1 to step 4 and directly store it in spark
>>> if I am storing it in spark where actually it is getting stored
>>> Do i need to retain HAdoop to store data
>>> or can i directly store it in spark and remove hadoop also?
>>> 
>>> I want to remove informatica for preprocessing and directly load the files data
coming from server to Hadoop/Spark
>>> 
>>> So My Question is Can i directly load files data to spark ? Then where exactly
the data will get stored.. Do I need to have Spark installed on Top of HDFS?
>>> 
>>> 2) if I am retaining below architecture Can I store back output from spark directly
to oracle from step 5 to step 7 
>>> 
>>> and will spark way of storing it back to oracle will be better than using sqoop
performance wise
>>> 3)Can I use SPark scala UDF to process data from hive and retain entire architecture

>>> 
>>> which among the above would be optimal
>>> 
>>> 
>>> 
>>>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.naik@gmail.com>
wrote:
>>>> I strongly agree with Jorn and Russell. There are different solutions for
data movement depending upon your needs frequency, bi-directional drivers. workflow, handling
duplicate records. This is a space is known as " Change Data Capture - CDC" for short. If
you need more information, I would be happy to chat with you.  I built some products in this
space that extensively used connection pooling over ODBC/JDBC. 
>>>> 
>>>> Happy to chat if you need more information. 
>>>> 
>>>> -Sachin Naik
>>>> 
>>>> >>Hard to tell. Can you give more insights >>on what you try
to achieve and what the data is about?
>>>> >>For example, depending on your use case sqoop can make sense or not.
>>>> Sent from my iPhone
>>>> 
>>>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spitzer@gmail.com>
wrote:
>>>>> 
>>>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back
out (see the same link) and write directly to Oracle. I'll leave the performance questions
for someone else. 
>>>>> 
>>>>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8123@gmail.com>
wrote:
>>>>>> 
>>>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8123@gmail.com>
wrote:
>>>>>> Hi Team,
>>>>>> 
>>>>>> RIght now our existing flow is
>>>>>> 
>>>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
Context)-->Destination Hive table -->sqoop export to Oracle
>>>>>> 
>>>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>>>> 
>>>>>> SO Now I want to know if I run the native scala UDF's than runninng
hive java udfs in spark-sql will there be any performance difference
>>>>>> 
>>>>>> 
>>>>>> Can we skip the Sqoop Import and export part and 
>>>>>> 
>>>>>> Instead directly load data from oracle to spark and code Scala UDF's
for transformations and export output data back to oracle?
>>>>>> 
>>>>>> RIght now the architecture we are using is
>>>>>> 
>>>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries -->
Spark-SQL--> Hive --> Oracle 
>>>>>> what would be optimal architecture to process data from oracle using
spark ?? can i anyway better this process ?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Sirisha 
>>>>>> 
>>> 
> 

Mime
View raw message