spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From losmi83 <>
Subject Partitioning under Spark 1.0.x
Date Tue, 19 Aug 2014 10:53:04 GMT
Hi guys,

I want to create two RDD[(K, V)] objects and then collocate partitions with
the same K on one node. 
When the same partitioner for two RDDs is used, partitions with the same K
end up being on different nodes.
Here is a small example that illustrates this:

    // Let's say I have 10 nodes
    val partitioner = new HashPartitioner(10)     
    // Create RDD
    val rdd = sc.parallelize(0 until 10).map(k => (k, computeValue(k)))
    // Partition twice using the same partitioner
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy1 ->
k = " + k) } 
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy2 ->
k = " + k) } 

The output on one node is: 
    Dummy1 -> k = 2 
    Dummy2 -> k = 7 

I was expecting to see the same keys on each node. That was happening under
Spark 0.9.2, but not under Spark 1.0.x.

Anyone has an idea what has changed in the meantime? Or how to get
corresponding partitions on one node?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message