spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Zipping RDDs of equal size not possible
Date Sat, 10 Jan 2015 05:56:13 GMT
"sample 2 * n tuples, split them into two parts, balance the sizes of
these parts by filtering some tuples out"

How do you guarantee that the two RDDs have the same size?

-Xiangrui

On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke
<1wilcke@informatik.uni-hamburg.de> wrote:
> Hi Spark community,
>
> I have a problem with zipping two RDDs of the same size and same number of
> partitions.
> The error message says that zipping is only allowed on RDDs which are
> partitioned into chunks of exactly the same sizes.
> How can I assure this? My workaround at the moment is to repartition both
> RDDs to only one partition but that obviously
> does not scale.
>
> This problem originates from my problem to draw n random tuple pairs (Tuple,
> Tuple) from an RDD[Tuple].
> What I do is to sample 2 * n tuples, split them into two parts, balance the
> sizes of these parts
> by filtering some tuples out and zipping them together.
>
> I would appreciate to read better approaches for both problems.
>
> Thanks in advance,
> Niklas

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message