spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Re: Silly question about Yarn client vs Yarn cluster modes...
Date Wed, 22 Jun 2016 16:36:26 GMT
Thanks Mike for clarification.

I think there is another option to get data out of RDBMS through some form
of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or
files. scp that file from the RDBMS directory to a private directory on
HDFS system  and push it into HDFS. That will by-pass the JDBC reliability
problem and I guess in this case one is in more control.

I do concur that there are security issues with this. For example that
local file system may have to have encryption etc  that will make it
tedious. I believe Jorn mentioned this somewhere.


Dr Mich Talebzadeh

LinkedIn *

On 22 June 2016 at 15:59, Michael Segel <> wrote:

> Hi,
> Just to clear a few things up…
> First I know its hard to describe some problems because they deal with
> client confidential information.
> (Also some basic ‘dead hooker’ thought problems to work through before
> facing them at a client.)
> The questions I pose here are very general and deal with some basic design
> issues/consideration.
> W.R.T JDBC / Beeline:
> There are many use cases where you don’t want to migrate some or all data
> to HDFS.  This is why tools like Apache Drill exist. At the same time…
> there are different cluster design patterns.  One such pattern is a
> storage/compute model where you have multiple clusters acting either as
> compute clusters which pull data from storage clusters. An example would be
> spinning up an EMR cluster and running a M/R job where you read from S3 and
> output to S3.  Or within your enterprise you have your Data Lake (Data
> Sewer) and then a compute cluster for analytics.
> In addition, you have some very nasty design issues to deal with like
> security. Yes, that’s the very dirty word nobody wants to deal with and in
> most of these tools, security is an afterthought.
> So you may not have direct access to the cluster or an edge node. You only
> have access to a single port on a single machine through the firewall,
> which is running beeline so you can pull data from your storage cluster.
> Its very possible that you have to pull data from the cluster thru beeline
> to store the data within a spark job running on the cluster. (Oh the irony!
> ;-)
> Its important to understand that due to design constraints, options like
> sqoop or running a query directly against Hive may not be possible and
> these use cases do exist when dealing with PII information.
> Its also important to realize that you may have to pull data from multiple
> data sources for some not so obvious but important reasons…
> So your spark app has to be able to generate an RDD from data in an RDBS
> (Oracle, DB2, etc …) persist it for local lookups and then pull data from
> the cluster and then output it either back to the cluster, another cluster,
> or someplace else.  All of these design issues occur when you’re dealing
> with large enterprises.
> But I digress…
> The reason I started this thread was to get a better handle on where
> things run when we make decisions to run either in client or cluster mode
> in YARN.
> Issues like starting/stopping long running apps are an issue in
> production.  Tying up cluster resources while your application remains
> dormant waiting for the next batch of events to come through.
> Trying to understand the gotchas of each design consideration so that we
> can weigh the pros and cons of a decision when designing a solution….
> Spark is relatively new and its a bit disruptive because it forces you to
> think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta
> Architecture’.  In addition, it forces you to rethink your cluster hardware
> design too.
> HTH to clarify the questions and of course thanks for the replies.
> -Mike
> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <>
> wrote:
> If you are going to get data out of an RDBMS like Oracle then the correct
> procedure is:
>    1. Use Hive on Spark execution engine. That improves Hive performance
>    2. You can use JDBC through Spark itself. No issue there. It will use
>    JDBC provided by HiveContext
>    3. JDBC is fine. Every vendor has a utility to migrate a full database
>    from one to another using JDBC. For example SAP relies on JDBC to migrate a
>    whole Oracle schema to SAP ASE
>    4. I have imported an Oracle table of 1 billion rows through Spark
>    into Hive ORC table. It works fine. Actually I use Spark to do the job for
>    these type of imports. Register a tempTable from DF and use it to put data
>    into Hive table. You can create Hive table explicitly in Spark and do an
>    INSERT/SELECT into it rather than save etc.
>    5. You can access Hive tables through HiveContext. Beeline is a client
>    tool that connects to Hive thrift server. I don't think it comes into
>    equation here
>    6. Finally one experiment worth multiples of these speculation. Try
>    for yourself and fine out.
>    7. If you want to use JDBC for an RDBMS table then you will need to
>    download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
> Like anything else your mileage varies and need to experiment with it.
> Otherwise these are all opinions.
> Dr Mich Talebzadeh
> LinkedIn *
> <>*
> On 22 June 2016 at 06:46, Jörn Franke <> wrote:
>> I would import data via sqoop and put it on HDFS. It has some mechanisms
>> to handle the lack of reliability by jdbc.
>> Then you can process the data via Spark. You could also use jdbc rdd but
>> I do not recommend to use it, because you do not want to pull data all the
>> time out of the database when you execute your application. Furthermore,
>> you have to handle connection interruptions, the multiple
>> serialization/deserialization efforts, if one executor crashes you have to
>> repull some or all of the data from the database etc
>> Within the cluster it does not make sense to me to pull data via jdbc
>> from hive. All the benefits such as data locality, reliability etc would be
>> gone.
>> Hive supports different execution engines (TEZ, Spark), formats (Orc,
>> parquet) and further optimizations to make the analysis fast. It always
>> depends on your use case.
>> On 22 Jun 2016, at 05:47, Michael Segel <>
>> wrote:
>> Sorry, I think you misunderstood.
>> Spark can read from JDBC sources so to say using beeline as a way to
>> access data is not a spark application isn’t really true.  Would you say
>> the same if you were pulling data in to spark from Oracle or DB2?
>> There are a couple of different design patterns and use cases where data
>> could be stored in Hive yet your only access method is via a JDBC or
>> Thift/Rest service.  Think also of compute / storage cluster
>> implementations.
>> WRT to #2, not exactly what I meant, by exposing the data… and there are
>> limitations to the thift service…
>> On Jun 21, 2016, at 5:44 PM, ayan guha <> wrote:
>> 1. Yes, in the sense you control number of executors from spark
>> application config.
>> 2. Any IO will be done from executors (never ever on driver, unless you
>> explicitly call collect()). For example, connection to a DB happens one for
>> each worker (and used by local executors). Also, if you run a reduceByKey
>> job and write to hdfs, you will find a bunch of files were written from
>> various executors. What happens when you want to expose the data to world:
>> Spark Thrift Server (STS), which is a long running spark application (ie
>> spark context) which can serve data from RDDs.
>> Suppose I have a data source… like a couple of hive tables and I access
>> the tables via beeline. (JDBC)  -
>> This is NOT a spark application, and there is no RDD created. Beeline is
>> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>> In this case… Hive generates a map/reduce job and then would stream the
>> result set back to the client node where the RDD result set would be built.
>>  --
>> This is never true. When you connect Hive from spark, spark actually
>> reads hive metastore and streams data directly from HDFS. Hive MR jobs do
>> not play any role here, making spark faster than hive.
>> HTH....
>> Ayan
>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <
>> > wrote:
>>> Ok, its at the end of the day and I’m trying to make sure I understand
>>> the locale of where things are running.
>>> I have an application where I have to query a bunch of sources, creating
>>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>> Yarn has two modes… client and cluster.
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the
>>> workers are running on the cluster.
>>> When I run a sparkSQL command that generates a new RDD, does the result
>>> set live on the cluster with the workers, and gets referenced by the
>>> driver, or does the result set get migrated to the driver running on the
>>> client? (I’m pretty sure I know the answer, but its never safe to assume
>>> anything…)
>>> The follow up questions:
>>> 1) If I kill the  app running the driver on the edge node… will that
>>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>>> doesn’t happen) What happens and how quickly?
>>> 1a) If using the client mode… can I spin up and spin down the number of
>>> executors on the cluster? (Assuming that when I kill an executor any
>>> portion of the RDDs associated with that executor are gone, however the
>>> spark context is still alive on the edge node? [again assuming that the
>>> spark context lives with the driver.])
>>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>>> through the data set and writing out a data set to a file) will occur on
>>> the edge node where the driver is located?  (This may seem kinda silly, but
>>> what happens when you want to expose the result set to the world… ? )
>>> Now for something slightly different…
>>> Suppose I have a data source… like a couple of hive tables and I access
>>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>>> job and then would stream the result set back to the client node where the
>>> RDD result set would be built.  I realize that I could run Hive on top of
>>> spark, but that’s a separate issue. Here the RDD will reside on the client
>>> only.  (That is I could in theory run this as a single spark instance.)
>>> If I were to run this on the cluster… then the result set would stream
>>> thru the beeline gate way and would reside back on the cluster sitting in
>>> RDDs within each executor?
>>> I realize that these are silly questions but I need to make sure that I
>>> know the flow of the data and where it ultimately resides.  There really is
>>> a method to my madness, and if I could explain it… these questions really
>>> would make sense. ;-)
>>> TIA,
>>> -Mike
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> --
>> Best Regards,
>> Ayan Guha

View raw message