spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Spark_JDBC_Partitions
Date Sat, 10 Sep 2016 15:37:35 GMT
Strange that Oracle table of 200Million plus rows has not been partitioned.

What matters here is to have parallel connections from JDBC to Oracle, each
reading a sub-set of table. Any parallel fetch is going to be better than
reading with one connection from Oracle.

Surely among 404 columns there must be one with high cardinality to satisfy
this work.

May be you should just create table <small> as select * from Oracle_table
where rownum <= 1000000; and use that for test.

Other alternative is to use Oracle SQL Connecter for HDFS
<https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
can do it for you. With 404 columns it is difficult to suggest any
alternative. Is this a FACT table?

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 September 2016 at 16:20, Ajay Chander <itschevva@gmail.com> wrote:

> Hello Everyone,
>
> My goal is to use Spark Sql to load huge amount of data from Oracle to
> HDFS.
>
> *Table in Oracle:*
> 1) no primary key.
> 2) Has 404 columns.
> 3) Has 200,800,000 rows.
>
> *Spark SQL:*
> In my Spark SQL I want to read the data into n number of partitions in
> parallel, for which I need to provide 'partition column','lowerBound',
> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has
> no such column to satisfy this need(Highly Skewed), because of it, if the
> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
> finishes in 20 mins and the last one takes forever.
>
> Is there anything I could do to distribute the data evenly into
> partitions? Can we set any fake query to orchestrate this pull process, as
> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>
> Any pointers are appreciated.
>
> Thanks for your time.
>
> ~ Ajay
>

Mime
View raw message