spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <em...@yeikel.com>
Subject RE: Question about relationship between number of files and initial tasks(partitions)
Date Sat, 13 Apr 2019 18:35:58 GMT
Before we can confirm that the issue is skewed data,  let’s confirm it : 

 

import org.apache.spark.sql.functions.spark_partition_id

 

df.groupBy(spark_partition_id).count

 

This should give the number of records you have in each partition. 

 

 

From: Sagar Grover <sagargrover16@gmail.com> 
Sent: Thursday, April 11, 2019 8:23 AM
To: yeikel valdes <email@yeikel.com>
Cc: jasonnerothin@gmail.com; arthur.li@flipp.com; user @spark/'user @spark'/spark users/user@spark
<user@spark.apache.org>
Subject: Re: Question about relationship between number of files and initial tasks(partitions)

 

Extending Arthur's question,

I am facing the same problem(no of partitions were huge- cored 960, partitions - 16000). I
tried to decrease the number of partitions with coalesce, but the problem is unbalanced data.
After using coalesce, it gives me Java out of heap space error. There was no out of heap error
without coalesce. I am guessing the error is due to uneven data and some heavy partitions
getting merged together.

Let me know if you have any pointers on how to handle this.

 

On Wed, Apr 10, 2019 at 11:21 PM yeikel valdes <email@yeikel.com <mailto:email@yeikel.com>
> wrote:

If you need to reduce the number of partitions you could also try 

df.coalesce


---- On Thu, 04 Apr 2019 06:52:26 -0700  <mailto:jasonnerothin@gmail.com> jasonnerothin@gmail.com
wrote ----

Have you tried something like this?

 

spark.conf.set("spark.sql.shuffle.partitions", "5" ) 

 

 

 

On Wed, Apr 3, 2019 at 8:37 PM Arthur Li < <mailto:arthur.li@flipp.com> arthur.li@flipp.com>
wrote:

Hi Sparkers,

 

I noticed that in my spark application, the number of tasks in the first stage is equal to
the number of files read by the application(at least for Avro) if the number of cpu cores
is less than the number of files. Though If cpu cores are more than number of files, it's
usually equal to default parallelism number. Why is it behave like this? Would this require
a lot of resource from the driver? Is there any way we can do to decrease the number of tasks(partitions)
in the first stage without merge files before loading? 

 

Thanks,

Arthur 

 


IMPORTANT NOTICE:  This message, including any attachments (hereinafter collectively referred
to as "Communication"), is intended only for the addressee(s) named above.  This Communication
may include information that is privileged, confidential and exempt from disclosure under
applicable law.  If the recipient of this Communication is not the intended recipient, or
the employee or agent responsible for delivering this Communication to the intended recipient,
you are notified that any dissemination, distribution or copying of this Communication is
strictly prohibited.  If you have received this Communication in error, please notify the
sender immediately by phone or email and permanently delete this Communication from your computer
without making a copy. Thank you.

 

 

-- 

Thanks,

Jason

 


Mime
View raw message