spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Spark_JDBC_Partitions
Date Sat, 10 Sep 2016 15:42:24 GMT
Hi,

Yea, spark does not have the same functionality with sqoop.
I think one of simple solutions is to assign unique ids on the oracle table
by yourself.
Thought?

// maropu


On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Strange that Oracle table of 200Million plus rows has not been partitioned.
>
> What matters here is to have parallel connections from JDBC to Oracle,
> each reading a sub-set of table. Any parallel fetch is going to be better
> than reading with one connection from Oracle.
>
> Surely among 404 columns there must be one with high cardinality to
> satisfy this work.
>
> May be you should just create table <small> as select * from Oracle_table
> where rownum <= 1000000; and use that for test.
>
> Other alternative is to use Oracle SQL Connecter for HDFS
> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
> can do it for you. With 404 columns it is difficult to suggest any
> alternative. Is this a FACT table?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 16:20, Ajay Chander <itschevva@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> My goal is to use Spark Sql to load huge amount of data from Oracle to
>> HDFS.
>>
>> *Table in Oracle:*
>> 1) no primary key.
>> 2) Has 404 columns.
>> 3) Has 200,800,000 rows.
>>
>> *Spark SQL:*
>> In my Spark SQL I want to read the data into n number of partitions in
>> parallel, for which I need to provide 'partition column','lowerBound',
>> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has
>> no such column to satisfy this need(Highly Skewed), because of it, if the
>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>> finishes in 20 mins and the last one takes forever.
>>
>> Is there anything I could do to distribute the data evenly into
>> partitions? Can we set any fake query to orchestrate this pull process, as
>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>
>> Any pointers are appreciated.
>>
>> Thanks for your time.
>>
>> ~ Ajay
>>
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message