spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kabeer Ahmed <kab...@gmx.co.uk>
Subject Re: Hive From Spark: Jdbc VS sparkContext
Date Fri, 13 Oct 2017 10:02:12 GMT
My take on this might sound a bit different. Here are few points to consider below:

1. Going through  Hive JDBC means that the application is restricted by the # of queries that
can be compiled. HS2 can only compile one SQL at a time and if users have bad SQL, it can
take a long time just to compile (not map reduce). This will reduce the query throughput i.e.
# of queries you can fire through the JDBC. 

2. Going through Hive JDBC does have an advantage that HMS service is protected. The JIRA:
https://issues.apache.org/jira/browse/HIVE-13884 does protect HMS from crashing - because
at the end of the day retrieving metadata about a Hive table that may have millions or simply
put 1000s of partitions hits jvm limit on the array size that it can hold for the metadata
retrieved. JVM array size limit is hit and there is a crash on HMS. So in effect this is good
to have to protect HMS & the relational database on its back end. 

Note: Hive community does propose to move the database to HBase that scales but I dont think
this will get implemented sooner.

3. Going through the SparkContext, it directly interfaces with the Hive MetaStore. I have
tried to put a sequence of code flow below. The bit I didnt have time to dive into is that
I believe if the table is really large i.e. say partitions in the table are more than 32K
(size of a short) then some sort of slicing does occur (I didnt have time to dive and get
this piece of code but from experience this does seem to occur). 

Code flow:
Spark uses Hive External catalog - goo.gl/7CZcDw
HiveClient version of getPartitions is -> goo.gl/ZAEsqQ
HiveClientImpl of getPartitions is: -> goo.gl/msPrr5
The Hive call is made at: -> goo.gl/TB4NFU
ThriftHiveMetastore.java ->  get_partitions_ps_with_auth

-1 value is sent within Spark all the way throughout to Hive Metastore thrift. So in effect
for large tables at a time 32K partitions are retrieved. This also has led to a few HMS crashes
but I am yet to identify if this is really the cause. 


Based on the 3 points above, I would prefer to use SparkContext. If the cause of crash is
indeed high # of partitions retrieval, then I may opt for the JDBC route. 

Thanks
Kabeer.


On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote:
>> In case a table has a few
>> million records, it all goes through the driver.
>
> This sounds clear in JDBC mode, the driver get all the rows and then it
> spreads the RDD over the executors.
>
> I d'say that most use cases deal with SQL to aggregate huge datasets,
> and retrieve small amount of rows to be then transformed for ML tasks.
> Then using JDBC offers the robustness of HIVE to produce a small aggregated
> dataset into spark. While using SPARK SQL uses RDD to produce the small
> one from huge.
>
> Not very clear how SPARK SQL deal with huge HIVE table. Does it load
> everything into memory and crash, or does this never happend?
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Sent using Dekko from my Ubuntu device

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message