spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Parallel read parquet file, write to postgresql
Date Mon, 03 Dec 2018 15:06:51 GMT
Hi James.

--num-executors is use to control the number of parallel tasks (each per
executors) running for your application. For reading and writing data in
parallel data partitioning is employed. You can look here for quick intro
how data partitioning work:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
.
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297

You are write that numPartitions is the parameter that could be used to
control that though in general spark itself identifies given the data in
each stage, how to partition (i.e. how much to parallelize the read and
write of data.)



On Mon, Dec 3, 2018 at 8:40 AM James Starks <suserft@protonmail.com.invalid>
wrote:

> Reading Spark doc (
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's
> not mentioned how to parallel read parquet file with SparkSession. Would
> --num-executors just work? Any additional parameters needed to be added to
> SparkSession as well?
>
> Also if I want to parallel write data to database, would options
> 'numPartitions' and 'batchsize' enough to improve write performance? For
> example,
>
>                  mydf.format("jdbc").
>                      option("driver", "org.postgresql.Driver").
>                      option("url", url).
>                      option("dbtable", table_name).
>                      option("user", username).
>                      option("password", password).
>                      option("numPartitions", N) .
>                      option("batchsize", M)
>                      save
>
> From Spark website (
> https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases),
> I only find these two parameters that would have impact  on db write
> performance.
>
> I appreciate any suggestions.
>

Mime
View raw message