spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: Partition Case Class RDD without ParRDDFunctions
Date Wed, 06 May 2015 10:09:31 GMT
How does your MyClqss looks like? I was experimenting with Row class in
python and apparently partitionby automatically takes first column as key.
However, I am not sure how you can access a part of an object without
deserializing it (either explicitly or Spark doing it for you)....

On Wed, May 6, 2015 at 7:14 PM, Night Wolf <> wrote:

> Hi,
> If I have an RDD[MyClass] and I want to partition it by the hash code of
> MyClass for performance reasons, is there any way to do this without
> converting it into a PairRDD RDD[(K,V)] and calling partitionBy???
> Mapping it to a tuple2 seems like a waste of space/computation.
> It looks like the PairRDDFunctions..partitionBy() uses a ShuffleRDD[K,V,C]
> requires K,V,C? Could I create a new
> ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)?
> Cheers,
> N

Best Regards,
Ayan Guha

View raw message