spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomasz Dudek <megatrontomaszdu...@gmail.com>
Subject Re: Question on using pseudo columns in spark jdbc options
Date Thu, 07 Dec 2017 19:20:00 GMT
Hey Ravion,

yes, you can obviously specify other column than a primary key. Be aware
though, that if the key range is not spread evenly (for example in your
code, if there's a "gap" in primary keys and no row has id between 0 and
17220) some of the executors may not assist in loading data (because
"SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an
empty result). I think you might want to repartition afterwards to ensure
that df is evenly distributed(<--- could somebody confirm my last sentence?
I don't want to mislead and I am not sure).

The first question - could you just check and provide us the answer? :)

Cheers,
Tomasz

2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <
ravishankar.nair@gmail.com>:

> Hi all,
>
> I am using a query to fetch data from MYSQL as follows:
>
> var df = spark.read.
> format("jdbc").
> option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
> option("driver" ,"com.mysql.jdbc.Driver").
> option("user", "retail_dba").
> option("password", "cloudera").
> option("dbtable", "orders").
> option("partitionColumn", "order_id").
> option("lowerBound", "1").
> option("upperBound", "68883").
> option("numPartitions", "4").
> load()
>
> Question is, can I use a pseudo column (like ROWNUM in Oracle or
> RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ?
> If not, can we specify a partition column which is not a primary key ?
>
> Best,
> Ravion
>
>
>
>

Mime
View raw message