spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: Hive From Spark: Jdbc VS sparkContext
Date Sun, 05 Nov 2017 13:27:04 GMT
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait :
> thanks a ton for your kind response. Have you used SPARK Session ? I think that
> hiveContext is a very old way of solving things in SPARK, and since then new
> algorithms have been introduced in SPARK. 

I will give a try out sparkSession. 

> It will be a lot of help, given how kind you have been by sharing your
> experience, if you could kindly share your code as well and provide details
> like SPARK , HADOOP, HIVE, and other environment version and details.

I am testing a HDP 2.6 distrib and also:
SPARK: 2.1.1
HADOOP: 2.7.3
HIVE: 1.2.1000
PRESTO: 1.87

> After all, no one wants to use SPARK 1.x version to solve problems anymore,
> though I have seen couple of companies who are stuck with these versions as
> they are using in house deployments which they cannot upgrade because of
> incompatibility issues.

Didn't know hiveContext was legacy spark way. I will give a try to
sparkSession and conclude. After all, I would prefer to provide our
users, a unique and uniform framework such spark, instead of multiple
complicated layers such spark + whatever jdbc access

> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <niparisco@gmail.com> wrote:
> 
>     Hi
> 
>     After some testing, I have been quite disapointed with hiveContext way of
>     accessing hive tables.
> 
>     The main problem is resource allocation: I have tons of users and they
>     get a limited subset of workers. Then this does not allow to query huge
>     datasetsn because to few memory allocated (or maybe I am missing
>     something).
> 
>     If using Hive jdbc, Hive resources are shared by all my users and then
>     queries are able to finish.
> 
>     Then I have been testing other jdbc based approach and for now, "presto"
>     looks like the most appropriate solution to access hive tables.
> 
>     In order to load huge datasets into spark, the proposed approach is to
>     use presto distributed CTAS to build an ORC dataset, and access to that
>     dataset from spark dataframe loader ability (instead of direct jdbc
>     access tha would break the driver memory).
> 
> 
> 
>     Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
>     > Hi Nicolas,
>     >
>     > without the hive thrift server, if you try to run a select * on a table
>     which
>     > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
>     works
>     > fine in these scenarios, and I am sure SPARK community will soon learn
>     from
>     > their algorithms.
>     >
>     >
>     > Regards,
>     > Gourav
>     >
>     > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <niparisco@gmail.com>
>     wrote:
>     >
>     >     > I do not think that SPARK will automatically determine the
>     partitions.
>     >     Actually
>     >     > it does not automatically determine the partitions. In case a table
>     has a
>     >     few
>     >     > million records, it all goes through the driver.
>     >
>     >     Hi Gourav
>     >
>     >     Actualy spark jdbc driver is able to deal direclty with partitions.
>     >     Sparks creates a jdbc connection for each partition.
>     >
>     >     All details explained in this post :
>     >     http://www.gatorsmile.io/numpartitionsinjdbc/
>     >
>     >     Also an example with greenplum database:
>     >     http://engineering.pivotal.io/post/getting-started-with-
>     greenplum-spark/
>     >
>     >
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message