spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Thalamati <suresh.thalam...@gmail.com>
Subject Re: Spark_JDBC_Partitions
Date Tue, 13 Sep 2016 18:05:31 GMT
There is also another  jdbc method in  data frame  reader api o specify your own predicates
for  each partition. Using this you can control what is included in  each partition.

val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 200")
val df = spark.read.jdbc(
  urlWithUserAndPass,
  "TEST.PEOPLE",
  predicates = jdbcPartitionWhereClause,
  new Properties())


Hope that helps. 
-suresh


> On Sep 13, 2016, at 9:44 AM, Rabin Banerjee <dev.rabin.banerjee@gmail.com> wrote:
> 
> Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector
which is known as  ORAOOP. Spark cannot do everything , 
> you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to
pull the data then spark batch to process .
> 
> Hope it helps !!
> 
> On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.racic@gmail.com <mailto:igor.racic@gmail.com>>
wrote:
> Hi, 
> 
> One way can be to use NTILE function to partition data. 
> Example:
> 
> REM Creating test table
> create table Test_part as select * from ( select rownum rn from all_tables t1 ) where
rn <= 1000;
> 
> REM Partition lines by Oracle block number, 11 partitions in this example. 
> select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part
> 
> 
> Let's see distribution: 
> 
> select nt, count(*) from ( select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER(
rowid ) ) nt from Test_part) group by nt;
> 
>         NT   COUNT(*)
> ---------- ----------
>          1         10
>          6         10
>         11          9
>          2         10
>          4         10
>          5         10
>          8         10
>          3         10
>          7         10
>          9          9
>         10          9
> 
> 11 rows selected.
> ^^ It looks good. Sure feel free to chose any other condition to order your lines as
best suits your case
> 
> So you can 
> 1) have one session reading and then decide where line goes (1 reader )
> 2) Or do multiple reads by specifying partition number. Note that in this case you read
whole table n times (in parallel) and is more internsive on read part. (multiple readers)
> 
> Regards, 
> Igor
> 
> 
> 
> 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>:
> Good points
> 
> Unfortunately databump. expr, imp use binary format for import and export. that cannot
be used to import data into HDFS in a suitable way.
> 
> One can use what is known as flat,sh script to get data out tab or , separated etc.
> 
> ROWNUM is a pseudocolumn (not a real column) that is available in a query. The issue
is that in a table of 280Million rows to get the position of the row it will have to do a
table scan since no index cannot be built on it (assuming there is no other suitable index).
Not ideal but can be done.
> 
> I think a better alternative is to use datapump to take that table to DEV/TEST, add a
sequence (like an IDENTITY column in Sybase), build a unique index on the sequence column
and do the partitioning there.
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 10 September 2016 at 22:37, ayan guha <guha.ayan@gmail.com <mailto:guha.ayan@gmail.com>>
wrote:
> In oracle something called row num is present in every row.  You can create an evenly
distribution using that column. If it is one time work, try using sqoop. Are you using Oracle's
own appliance? Then you can use data pump format
> 
> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
> creating an Oracle sequence for a table of 200million is not going to be that easy without
changing the schema. It is possible to export that table from prod and import it to DEV/TEST
and create the sequence there.
> 
> If it is a FACT table then the foreign keys from the Dimension tables will be bitmap
indexes on the FACT table so they can be potentially used.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin.m.s@gmail.com <mailto:linguin.m.s@gmail.com>>
wrote:
> Hi,
> 
> Yea, spark does not have the same functionality with sqoop.
> I think one of simple solutions is to assign unique ids on the oracle table by yourself.
> Thought?
> 
> // maropu
> 
> 
> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
> Strange that Oracle table of 200Million plus rows has not been partitioned.
> 
> What matters here is to have parallel connections from JDBC to Oracle, each reading a
sub-set of table. Any parallel fetch is going to be better than reading with one connection
from Oracle.
> 
> Surely among 404 columns there must be one with high cardinality to satisfy this work.
> 
> May be you should just create table <small> as select * from Oracle_table where
rownum <= 1000000; and use that for test.
> 
> Other alternative is to use Oracle SQL Connecter for HDFS  <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
can do it for you. With 404 columns it is difficult to suggest any alternative. Is this a
FACT table?
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 10 September 2016 at 16:20, Ajay Chander <itschevva@gmail.com <mailto:itschevva@gmail.com>>
wrote:
> Hello Everyone,
> 
> My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS.
> 
> Table in Oracle:
> 1) no primary key.
> 2) Has 404 columns.
> 3) Has 200,800,000 rows.
> 
> Spark SQL:
> In my Spark SQL I want to read the data into n number of partitions in parallel, for
which I need to provide 'partition column','lowerBound', 'upperbound', 'numPartitions' from
the table Oracle. My table in Oracle has no such column to satisfy this need(Highly Skewed),
because of it, if the numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
finishes in 20 mins and the last one takes forever. 
> 
> Is there anything I could do to distribute the data evenly into partitions? Can we set
any fake query to orchestrate this pull process, as we do in SQOOP like this '--boundary-query
"SELECT CAST(0 AS NUMBER) AS MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
> 
> Any pointers are appreciated.
> 
> Thanks for your time.
> 
> ~ Ajay
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> 


Mime
View raw message