spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madabhattula Rajesh Kumar <mrajaf...@gmail.com>
Subject Re: parallel processing with JDBC
Date Mon, 15 Aug 2016 07:07:40 GMT
Hi Mich,

I have a below question.

I want to join two tables and return the result based on the input value.
In this case, how we need to specify lower bound and upper bound values ?

select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
t1.transactionid=*11111* and t1.id = t2.id

*11111 => dynamic input value.*

Regards,
Rajesh

On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> If you have your RDBMS table partitioned, then you need to consider how
> much data you want to extract in other words the result set returned by the
> JDBC call.
>
> If you want all the data, then the number of partitions specified in the
> JDBC call should be equal to the number of partitions in your RDBMS table.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 August 2016 at 21:44, Ashok Kumar <ashok34668@yahoo.com> wrote:
>
>> Thank you very much sir.
>>
>> I forgot to mention that two of these Oracle tables are range
>> partitioned. In that case what would be the optimum number of partitions if
>> you can share?
>>
>> Warmest
>>
>>
>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>
>> If you have primary keys on these tables then you can parallelise the
>> process reading data.
>>
>> You have to be careful not to set the number of partitions too many.
>> Certainly there is a balance between the number of partitions supplied to
>> JDBC and the load on the network and the source DB.
>>
>> Assuming that your underlying table has primary key ID, then this will
>> create 20 parallel processes to Oracle DB
>>
>>  val d = HiveContext.read.format("jdbc").options(
>>  Map("url" -> _ORACLEserver,
>>  "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
>>  "partitionColumn" -> "ID",
>>  "lowerBound" -> "1",
>>  "upperBound" -> "maxID",
>>  "numPartitions" -> "20",
>>  "user" -> _username,
>>  "password" -> _password)).load
>>
>> assuming your upper bound on ID is maxID
>>
>>
>> This will open multiple connections to RDBMS, each getting a subset of
>> data that you want.
>>
>> You need to test it to ensure that you get the numPartitions optimum and
>> you don't overload any component.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 14 August 2016 at 21:15, Ashok Kumar <ashok34668@yahoo.com.invalid>
>> wrote:
>>
>> Hi,
>>
>> There are 4 tables ranging from 10 million to 100 million rows but they
>> all have primary keys.
>>
>> The network is fine but our Oracle is RAC and we can only connect to a
>> designated Oracle node (where we have a DQ account only).
>>
>> We have a limited time window of few hours to get the required data out.
>>
>> Thanks
>>
>>
>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>
>> How big are your tables and is there any issue with the network between
>> your Spark nodes and your Oracle DB that adds to issues?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich. wordpress.com
>> <http://talebzadehmich.wordpress.com/>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 14 August 2016 at 20:50, Ashok Kumar <ashok34668@yahoo.com.invalid>
>> wrote:
>>
>> Hi Gurus,
>>
>> I have few large tables in rdbms (ours is Oracle). We want to access
>> these tables through Spark JDBC
>>
>> What is the quickest way of getting data into Spark Dataframe say
>> multiple connections from Spark
>>
>> thanking you
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message