spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: writing into oracle database is very slow
Date Fri, 19 Apr 2019 21:36:29 GMT
I don't believe there is inherently anything slow writing to Oracle.

This piece of code works fine for me ready a csv file from HDFS and writing
it to Oracle table

import java.sql.DriverManager
import java.sql.Connection
import java.sql.DatabaseMetaData
import java.sql.ResultSet
import java.sql.SQLException
import java.util.ArrayList
import org.apache.spark.sql.functions._
import java.sql.{Date, Timestamp}
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val driverName = "oracle.jdbc.OracleDriver"
var url= "jdbc:oracle:thin:@rhes564:1521:mydb12"
var _username = "scratchpad"
var _password = "oracle"
var _dbschema = "SCRATCHPAD"
var _dbtable = "ANALYTICS"
var e:SQLException = null
var connection:Connection = null
var metadata:DatabaseMetaData = null
println ("\nStarted at"); spark.sql("SELECT FROM_unixtime(unix_timestamp(),
'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
//
// Get a DF first based on Databricks CSV libraries ignore column heading
because of column called "Type"
//
val df = spark.read.option("header",
true).csv("hdfs://rhes75:9000/data/stg/domains/analytics")
//
// Map the columns to names
//
val a = df.select(col("Vendor").as("VENDOR"),
col("product_name").as("PRODUCT_NAME"),col("version").as("VERSION"),
col("release").as("RELEASE"),
     col("edition").as("EDITION"), col("service_pack").as("SERVICE_PACK"),
col("MPLID").as("MPLID"), col("Product Impact").as("PRODUCTIMPACT"),
col("Hardware / Software / OS").as("HARDWARESOFTWAREOS"),
     col("Software Design").as("SOFTWAREDESIGN"), col("Data
Processing").as("DATAPROCESSING"), col("Big Data").as("BIGDATA"),
col("Distributed Ledgers").as("DISTRIBUTEDLEDGERS"),
     col("Analytics").as("ANALYTICS"), col("Distributed
Systems").as("DISTRIBUTEDSYSTEMS"), col("API").as("API"),
col("Compute").as("COMPUTE"), col("Desktop").as("DESKTOP"),
col("UI").as("UI"),
     col("Process Automation").as("PROCESSAUTOMATION"),
col("Security").as("SECURITY"), col("DevOps").as("DEVOPS"),
col("Comments").as("COMMENTS"))
// Define prop
val prop = new java.util.Properties
prop.setProperty("user", _username)
prop.setProperty("password",_password)
//
// populate Oracle table
//
// Check Oracle is accessible
try {
      connection = DriverManager.getConnection(url, _username, _password)
} catch {
  case e: SQLException => e.printStackTrace
  connection.close()
}
metadata = connection.getMetaData()
// Check table exists
var rs:ResultSet = metadata.getTables(null,_dbschema,_dbtable, null)
if (rs.next()) {
   println("Table " + _dbschema+"."+_dbtable + " exists")
} else {
   println("Table " + _dbschema+"."+_dbtable + " does not exist, quitting!")
   connection.close()
   sys.exit(1)
}
a.write.mode("overwrite").jdbc(url, _dbschema+"."+_dbtable, prop)
//
// Test all went OK by looking at some old transactions
//
val s = HiveContext.read.format("jdbc").options(
       Map("url" -> url,
       "dbtable" -> _dbtable,
       "user" -> _username,
       "password" -> _password)).load
s.orderBy('VENDOR,'PRODUCT_NAME).collect.foreach(println)
//
println ("\nFinished at"); spark.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
").collect.foreach(println)
sys.exit()

I have uploaded the Scala code, the CSV example file (source of data) and
the DDL to create Oracle table. Need minor changes to schema name and
Oracle host and Port where the listener running on.

You ought to check where the bottleneck is by looking at Oracle performance
as well. you can use wait_detection.sql to see what is taking the resources
and waits or ask your DBA to do it for you.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 19 Apr 2019 at 19:43, Lian Jiang <jiangok2006@gmail.com> wrote:

> Thanks for interesting ideas! Looks like spark directly writing to
> relational database is not as straight forward as I expected.
>
> Sent from my iPhone
>
> On Apr 19, 2019, at 06:58, Khare, Ankit <ankit.khare@eon.com> wrote:
>
> Hi Jiang
>
> We faced similar issue so we write the file and then use sqoop to export
> data to mssql.
>
> We achieved a great time benefit with this strategy.
>
> Sent from my iPhone
>
> On 19. Apr 2019, at 10:47, spark receiver <spark.receiver@gmail.com>
> wrote:
>
> hi Jiang,
>
> i was facing the very same issue ,the solution is write to file and using
> oracle external table to do the insert.
>
> hope this could help.
>
> Dalin
>
> On Thu, Apr 18, 2019 at 11:43 AM Jörn Franke <jornfranke@gmail.com> wrote:
>
>> What is the size of the data? How much time does it need on HDFS and how
>> much on Oracle? How many partitions do you have on Oracle side?
>>
>> Am 06.04.2019 um 16:59 schrieb Lian Jiang <jiangok2006@gmail.com>:
>>
>> Hi,
>>
>> My spark job writes into oracle db using:
>>
>> df.coalesce(10).write.format("jdbc").option("url", url)
>>   .option("driver", driver).option("user", user)
>>   .option("batchsize", 2000)
>>   .option("password", password).option("dbtable", tableName).mode("append").save()
>>
>> It is much slow than writting into HDFS. The data to write is small.
>>
>> Is this expected? Thanks for any clue.
>>
>>
>>

Mime
View raw message