spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "@Sanjiv Singh" <sanjiv.is...@gmail.com>
Subject Re: Spark SQL Parallelism - While reading from Oracle
Date Wed, 10 Aug 2016 15:28:45 GMT
Use it ....
You can set up all the properties (driver,partitionColumn, lowerBound,
upperBound, numPartitions) you should start with the driver at first.

Now you have the maximum id so you can use it for the upperBound parameter.
The numPartitions now based on your table's dimensions and your actual
system what you use. Now with this snippet you read a database table to a
dataframe with Spark.

df = sqlContext.read.format('jdbc').options(
        url="jdbc:mysql://ip-address:3306/sometable?user=username&password=password",
        dbtable=*sometable*,
        driver="com.mysql.jdbc.Driver",
        *partitionColumn*="id",
        *lowerBound *= 1,
        *upperBound *= maxId,
        *numPartitions *= 100
        ).load()



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Wed, Aug 10, 2016 at 6:35 AM, Siva A <siva9940261121@gmail.com> wrote:

> Hi Team,
>
> How do we increase the parallelism in Spark SQL.
> In Spark Core, we can re-partition or pass extra arguments part of the
> transformation.
>
> I am trying the below example,
>
> val df1 = sqlContext.read.format("jdbc").options(Map(...)).load
> val df2= df1.cache
> val df2.count
>
> Here count operation using only one task. I couldn't increase the
> parallelism.
> Thanks in advance
>
> Thanks
> Siva
>

Mime
View raw message