spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <>
Subject Why is this operation so expensive
Date Tue, 25 Nov 2014 18:06:50 GMT
I have an JavaPairRDD<KeyType,Tuple2<Type1,Type2>> originalPairs. There are
on the order of 100 million elements

I call a function to rearrange the tuples
  JavaPairRDD<String,Tuple2<Type1,Type2>>   newPairs =
originalPairs.values().mapToPair(new PairFunction<Tuple2<Type1,Type2>,
String, Tuple2<IType1,Type2>> {
        public Tuple2<String, Tuple2<Type1,Type2>> doCall(final
Tuple2<Type1,Type2> t)  {
            return new Tuple2<String, Tuple2<Type1,Type2>>(t._1().getId(),

where Type1.getId() returns a String

The data are spread across 120 partitions on 15 machines. The operation is
dead simple and yet it takes 5 minutes to generate the data and over 30
minutes to perform this simple operation. I am at a loss to  understand
what is taking so long or how to make it faster. It this stage there is no
reason to move data to different partitions
Anyone have bright ideas - Oh yes Type1 and Type2 are moderately complex
objects weighing in at about 10kb

View raw message