spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artur R <>
Subject How to redistribute dataset without full shuffle
Date Fri, 17 Mar 2017 21:52:01 GMT

I use Spark heavily for various workloads and always fall in the situation
when there is some skewed dataset (without any partitioner assigned) and I
just want to "redistribute" its data more evenly.

For example, say there is RDD of X partitions with Y rows on each except
one large partition with Y * 10 rows. I don't want to change number of
partitions, only redistribute it. Obviously, such operation should not send
more than ~Y * 9 rows across the network.
But the only option available is repartition that requires full shuffle
that takes ALL (X * Y) rows.

The question: why there is no such operation like "redistribute"?

View raw message