spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Onur EKİNCİ <oeki...@innova.com.tr>
Subject RE: Run jobs in parallel in standalone mode
Date Tue, 16 Jan 2018 13:01:05 GMT
Thank you Bill.

What about the number of ColumnProcessor.java:50 jobs?
How can we change their number? Does Spark configure them automatically? I think Spark extract
data column by column?





Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Bill Schwanitz [mailto:bilsch@bilsch.org]
Sent: Tuesday, January 16, 2018 3:39 PM
To: Onur EKİNCİ <oekinci@innova.com.tr>
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads

I had the same issue with a different db but its down in the jdbc and task management. You
need to specify a column with upper and lower bounds. Also need to specify how many threads
to use ( 1 thread per worker ).

On Tue, Jan 16, 2018 at 3:00 AM, Onur EKİNCİ <oekinci@innova.com.tr<mailto:oekinci@innova.com.tr>>
wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have
tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each
of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable",
"dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER",
"muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially
not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


[cid:image001.jpg@01D38EE3.3618C630]

[cid:image003.jpg@01D38EE3.3618C630]

[cid:image004.jpg@01D38EE3.3618C630]
[cid:image005.jpg@01D38EE3.3618C630]

[cid:image007.jpg@01D38EE3.3618C630]






[cid:image009.jpg@01D38EE3.3618C630]


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image010.png@01D38EE3.3618C630]<http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına
tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

Mime
View raw message