spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: spark session jdbc performance
Date Wed, 25 Oct 2017 06:41:43 GMT
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn.
That is too many partitions for any system to handle. The numPartitions
will be valid in case of JDBC very differently.

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <vmadhire@umail.iu.edu>
wrote:

>
> Hi,
>
>
>
> I am trying to fetch data from Oracle DB using a subquery and experiencing
> lot of performance issues.
>
>
>
> Below is the query I am using,
>
>
>
> *Using Spark 2.0.2*
>
>
>
> *val *df = spark_session.read.format(*"jdbc"*)
> .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*)
> .option(*"url"*, jdbc_url)
>    .option(*"user"*, user)
>    .option(*"password"*, pwd)
>    .option(*"dbtable"*, *"subquery"*)
>    .option(*"partitionColumn"*, *"id"*)  //primary key column uniformly
> distributed
>    .option(*"lowerBound"*, *"1"*)
>    .option(*"upperBound"*, *"500000"*)
> .option(*"numPartitions"*, 30)
> .load()
>
>
>
> The above query is running using the 30 partitions, but when I see the UI
> it is only using 1 partiton to run the query.
>
>
>
> Can anyone tell if I am missing anything or do I need to anything else to
> tune the performance of the query.
>
>  *Thanks*
>

Mime
View raw message