spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject PySpark RDD.partitionBy() requires an RDD of tuples
Date Tue, 01 Apr 2014 22:01:07 GMT
Just an FYI, it's not obvious from the
docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that
the following code should fail:

a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
b._jrdd.splits().size()
b.count()

I figured out from the example that if I generated a key by doing this

b = a.map(lambda x: (x, x)).partitionBy(5)

then all would be well.

In other words, partitionBy() only works on RDDs of tuples. Is that correct?

Nick




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message