spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <>
Subject Re: Shuffle produces one huge partition
Date Wed, 17 Jun 2015 07:43:21 GMT
Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)

Best Regards

On Wed, Jun 17, 2015 at 12:15 PM, Al M <> wrote:

> I have 2 RDDs I want to Join.  We will call them RDD A and RDD B.  RDD A
> has
> 1 billion rows; RDD B has 100k rows.  I want to join them on a single key.
> 95% of the rows in RDD A have the same key to join with RDD B.  Before I
> can
> join the two RDDs, I must map them to tuples where the first element is the
> key and the second is the value.
> Since 95% of the rows in RDD A have the same key, they now go into the same
> partition.  When I perform the join, the system will try to execute this
> partition in just one task.  This one task will try to load too much data
> into memory at once and die a horrible death.
> I know that this is caused by the HashPartitioner that is used by default
> in
> Spark; everything with the same hashed key goes into the same partition.  I
> also tried the RangePartitioner but still saw 95% of the data go into the
> same partition.  What I'd really like is a partitioner that puts everything
> with the same key into the same partition *except* when the partition is
> over a certain size, then it would just spill into the next partition.
> Writing my own partitioner is a big job, and requires a lot of testing to
> make sure I get it right.  Is there a simpler way to solve this?
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message