spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <>
Subject Distributing Computation across slaves
Date Thu, 15 Jan 2015 21:15:20 GMT
  I have a job involving two sets of data indexed with the same type of key.
I have an expensive operation that I want to run on pairs sharing the same
key. The following code works BUT all of the work is being done on 3 of 16
processors -
   How do I go about diagnosing and fixing the behavior. A shuffle would
take a lot less time than running MyExpensiveOperation on all the data

JavaRDD<MyKey,Type1> set1;
JavaRDD<MyKey,Type2 set2;

I do a join

JavaRDD<MyKey,Tuple2<Type1,Type2> joinSet = set1.join(set2);

JavaRDD<MyResult> results = joinSet.values().map(new

View raw message