spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject repartitionByRange and number of tasks
Date Tue, 30 Jul 2019 01:05:35 GMT
Hi,

*Hardware and Spark Details:*
* Spark 2.4.3
* EMR 30 node cluster with each executor having 4 cores and 15 GB RAM. At
100% allocation 4 executors are running in each node

*Question:*
when I am executing the following code then the around 60 partitions being
written out using only 20 tasks which runs in parallel. Its late in the
night here and I am failing to understand what needs to be done in order to
increasing the number of tasks while writing. We can clearly write with
more than 20 tasks in parallel.
The reads of around 50,000 historical files happens in parallel but the
issue is with the write, which has only 20 tasks for around 60 partitions.

*Code:*
spark.read.csv("s3://bucket_name/file_initials_*.gz").repartitionByRange(60,
"partition_field").write.partitionBy("partition_field").parquet("s3://bucket_name/key/")


Thanks and Regards,
Gourav Sengupta

Mime
View raw message