spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Starks <suse...@protonmail.com.INVALID>
Subject Parallel read parquet file, write to postgresql
Date Mon, 03 Dec 2018 13:40:41 GMT
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's
not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just
work? Any additional parameters needed to be added to SparkSession as well?

Also if I want to parallel write data to database, would options 'numPartitions' and 'batchsize'
enough to improve write performance? For example,

                 mydf.format("jdbc").
                     option("driver", "org.postgresql.Driver").
                     option("url", url).
                     option("dbtable", table_name).
                     option("user", username).
                     option("password", password).
                     option("numPartitions", N) .
                     option("batchsize", M)
                     save

From Spark website (https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases),
I only find these two parameters that would have impact  on db write performance.

I appreciate any suggestions.
Mime
View raw message