spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Nerothin <>
Subject Re: Question about relationship between number of files and initial tasks(partitions)
Date Thu, 04 Apr 2019 13:52:26 GMT
Have you tried something like this?

spark.conf.set("spark.sql.shuffle.partitions", "5" )

On Wed, Apr 3, 2019 at 8:37 PM Arthur Li <> wrote:

> Hi Sparkers,
> I noticed that in my spark application, the number of tasks in the first
> stage is equal to the number of files read by the application(at least for
> Avro) if the number of cpu cores is less than the number of files. Though
> If cpu cores are more than number of files, it's usually equal to default
> parallelism number. Why is it behave like this? Would this require a lot of
> resource from the driver? Is there any way we can do to decrease the number
> of tasks(partitions) in the first stage without merge files before loading?
> Thanks,
> Arthur
> IMPORTANT NOTICE:  This message, including any attachments (hereinafter
> collectively referred to as "Communication"), is intended only for the addressee(s)
> named above.  This Communication may include information that is
> privileged, confidential and exempt from disclosure under applicable law.
> If the recipient of this Communication is not the intended recipient, or
> the employee or agent responsible for delivering this Communication to the
> intended recipient, you are notified that any dissemination, distribution
> or copying of this Communication is strictly prohibited.  If you have
> received this Communication in error, please notify the sender immediately
> by phone or email and permanently delete this Communication from your
> computer without making a copy. Thank you.


View raw message