spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <>
Subject Re: Silly question about Yarn client vs Yarn cluster modes...
Date Wed, 22 Jun 2016 14:59:51 GMT

Just to clear a few things up… 

First I know its hard to describe some problems because they deal with client confidential
(Also some basic ‘dead hooker’ thought problems to work through before facing them at
a client.) 
The questions I pose here are very general and deal with some basic design issues/consideration.

W.R.T JDBC / Beeline:
There are many use cases where you don’t want to migrate some or all data to HDFS.  This
is why tools like Apache Drill exist. At the same time… there are different cluster design
patterns.  One such pattern is a storage/compute model where you have multiple clusters acting
either as compute clusters which pull data from storage clusters. An example would be spinning
up an EMR cluster and running a M/R job where you read from S3 and output to S3.  Or within
your enterprise you have your Data Lake (Data Sewer) and then a compute cluster for analytics.

In addition, you have some very nasty design issues to deal with like security. Yes, that’s
the very dirty word nobody wants to deal with and in most of these tools, security is an afterthought.
So you may not have direct access to the cluster or an edge node. You only have access to
a single port on a single machine through the firewall, which is running beeline so you can
pull data from your storage cluster. 

Its very possible that you have to pull data from the cluster thru beeline to store the data
within a spark job running on the cluster. (Oh the irony! ;-) 

Its important to understand that due to design constraints, options like sqoop or running
a query directly against Hive may not be possible and these use cases do exist when dealing
with PII information.

Its also important to realize that you may have to pull data from multiple data sources for
some not so obvious but important reasons… 
So your spark app has to be able to generate an RDD from data in an RDBS (Oracle, DB2, etc
…) persist it for local lookups and then pull data from the cluster and then output it either
back to the cluster, another cluster, or someplace else.  All of these design issues occur
when you’re dealing with large enterprises. 

But I digress… 

The reason I started this thread was to get a better handle on where things run when we make
decisions to run either in client or cluster mode in YARN. 
Issues like starting/stopping long running apps are an issue in production.  Tying up cluster
resources while your application remains dormant waiting for the next batch of events to come
Trying to understand the gotchas of each design consideration so that we can weigh the pros
and cons of a decision when designing a solution…. 

Spark is relatively new and its a bit disruptive because it forces you to think in terms of
storage/compute models or Jim Scott’s holistic ‘Zeta Architecture’.  In addition, it
forces you to rethink your cluster hardware design too. 

HTH to clarify the questions and of course thanks for the replies. 


> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <> wrote:
> If you are going to get data out of an RDBMS like Oracle then the correct procedure is:
> Use Hive on Spark execution engine. That improves Hive performance
> You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext
> JDBC is fine. Every vendor has a utility to migrate a full database from one to another
using JDBC. For example SAP relies on JDBC to migrate a whole Oracle schema to SAP ASE
> I have imported an Oracle table of 1 billion rows through Spark into Hive ORC table.
It works fine. Actually I use Spark to do the job for these type of imports. Register a tempTable
from DF and use it to put data into Hive table. You can create Hive table explicitly in Spark
and do an INSERT/SELECT into it rather than save etc.  
> You can access Hive tables through HiveContext. Beeline is a client tool that connects
to Hive thrift server. I don't think it comes into equation here
> Finally one experiment worth multiples of these speculation. Try for yourself and fine
> If you want to use JDBC for an RDBMS table then you will need to download the relevant
JAR file. For example for Oracle it is ojdbc6.jar etc
> Like anything else your mileage varies and need to experiment with it. Otherwise these
are all opinions.
> Dr Mich Talebzadeh
> LinkedIn
> <>
> On 22 June 2016 at 06:46, Jörn Franke < <>>
> I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the
lack of reliability by jdbc. 
> Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend
to use it, because you do not want to pull data all the time out of the database when you
execute your application. Furthermore, you have to handle connection interruptions, the multiple
serialization/deserialization efforts, if one executor crashes you have to repull some or
all of the data from the database etc
> Within the cluster it does not make sense to me to pull data via jdbc from hive. All
the benefits such as data locality, reliability etc would be gone.
> Hive supports different execution engines (TEZ, Spark), formats (Orc, parquet) and further
optimizations to make the analysis fast. It always depends on your use case.
> On 22 Jun 2016, at 05:47, Michael Segel < <>>
>> Sorry, I think you misunderstood. 
>> Spark can read from JDBC sources so to say using beeline as a way to access data
is not a spark application isn’t really true.  Would you say the same if you were pulling
data in to spark from Oracle or DB2? 
>> There are a couple of different design patterns and use cases where data could be
stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also
of compute / storage cluster implementations. 
>> WRT to #2, not exactly what I meant, by exposing the data… and there are limitations
to the thift service…
>>> On Jun 21, 2016, at 5:44 PM, ayan guha < <>>
>>> 1. Yes, in the sense you control number of executors from spark application config.

>>> 2. Any IO will be done from executors (never ever on driver, unless you explicitly
call collect()). For example, connection to a DB happens one for each worker (and used by
local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch
of files were written from various executors. What happens when you want to expose the data
to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context)
which can serve data from RDDs. 
>>> Suppose I have a data source… like a couple of hive tables and I access the
tables via beeline. (JDBC)  -  
>>> This is NOT a spark application, and there is no RDD created. Beeline is just
a jdbc client tool. You use beeline to connect to HS2 or STS. 
>>> In this case… Hive generates a map/reduce job and then would stream the result
set back to the client node where the RDD result set would be built.  -- 
>>> This is never true. When you connect Hive from spark, spark actually reads hive
metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making
spark faster than hive. 
>>> HTH....
>>> Ayan
>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <
<>> wrote:
>>> Ok, its at the end of the day and I’m trying to make sure I understand the
locale of where things are running.
>>> I have an application where I have to query a bunch of sources, creating some
RDDs and then I need to join off the RDDs and some other lookup tables.
>>> Yarn has two modes… client and cluster.
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the workers
are running on the cluster.
>>> When I run a sparkSQL command that generates a new RDD, does the result set live
on the cluster with the workers, and gets referenced by the driver, or does the result set
get migrated to the driver running on the client? (I’m pretty sure I know the answer, but
its never safe to assume anything…)
>>> The follow up questions:
>>> 1) If I kill the  app running the driver on the edge node… will that cause
YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What
happens and how quickly?
>>> 1a) If using the client mode… can I spin up and spin down the number of executors
on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated
with that executor are gone, however the spark context is still alive on the edge node? [again
assuming that the spark context lives with the driver.])
>>> 2) Any I/O between my spark job and the outside world… (e.g. walking through
the data set and writing out a data set to a file) will occur on the edge node where the driver
is located?  (This may seem kinda silly, but what happens when you want to expose the result
set to the world… ? )
>>> Now for something slightly different…
>>> Suppose I have a data source… like a couple of hive tables and I access the
tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would
stream the result set back to the client node where the RDD result set would be built.  I
realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD
will reside on the client only.  (That is I could in theory run this as a single spark instance.)
>>> If I were to run this on the cluster… then the result set would stream thru
the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
>>> I realize that these are silly questions but I need to make sure that I know
the flow of the data and where it ultimately resides.  There really is a method to my madness,
and if I could explain it… these questions really would make sense. ;-)
>>> TIA,
>>> -Mike
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: <>
>>> For additional commands, e-mail: <>
>>> -- 
>>> Best Regards,
>>> Ayan Guha

View raw message