spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Hive From Spark: Jdbc VS sparkContext
Date Sun, 05 Nov 2017 13:11:24 GMT
Hi Nicolas,


thanks a ton for your kind response. Have you used SPARK Session ? I think
that hiveContext is a very old way of solving things in SPARK, and since
then new algorithms have been introduced in SPARK.

It will be a lot of help, given how kind you have been by sharing your
experience, if you could kindly share your code as well and provide details
like SPARK , HADOOP, HIVE, and other environment version and details.

After all, no one wants to use SPARK 1.x version to solve problems anymore,
though I have seen couple of companies who are stuck with these versions as
they are using in house deployments which they cannot upgrade because of
incompatibility issues.


Regards,
Gourav Sengupta


On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <niparisco@gmail.com> wrote:

> Hi
>
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
>
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
>
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
>
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
>
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
>
>
>
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> >
> > without the hive thrift server, if you try to run a select * on a table
> which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
> works
> > fine in these scenarios, and I am sure SPARK community will soon learn
> from
> > their algorithms.
> >
> >
> > Regards,
> > Gourav
> >
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <niparisco@gmail.com>
> wrote:
> >
> >     > I do not think that SPARK will automatically determine the
> partitions.
> >     Actually
> >     > it does not automatically determine the partitions. In case a
> table has a
> >     few
> >     > million records, it all goes through the driver.
> >
> >     Hi Gourav
> >
> >     Actualy spark jdbc driver is able to deal direclty with partitions.
> >     Sparks creates a jdbc connection for each partition.
> >
> >     All details explained in this post :
> >     http://www.gatorsmile.io/numpartitionsinjdbc/
> >
> >     Also an example with greenplum database:
> >     http://engineering.pivotal.io/post/getting-started-with-
> greenplum-spark/
> >
> >
>

Mime
View raw message