spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: PySpark RDD.partitionBy() requires an RDD of tuples
Date Tue, 01 Apr 2014 23:38:46 GMT
Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is "repartition()",
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
this behavior from reading the Python docs (though it is consistent with
the Scala API, just more type-safe there).


On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> Just an FYI, it's not obvious from the docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
the following code should fail:
>
> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
> a._jrdd.splits().size()
> a.count()
> b = a.partitionBy(5)
> b._jrdd.splits().size()
> b.count()
>
> I figured out from the example that if I generated a key by doing this
>
> b = a.map(lambda x: (x, x)).partitionBy(5)
>
>  then all would be well.
>
> In other words, partitionBy() only works on RDDs of tuples. Is that
> correct?
>
> Nick
>
>
> ------------------------------
> View this message in context: PySpark RDD.partitionBy() requires an RDD
> of tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>

Mime
View raw message