spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Darabos <daniel.dara...@lynxanalytics.com>
Subject Re: Efficient self-joins
Date Mon, 08 Dec 2014 14:47:18 GMT
Could you not use a groupByKey instead of the join? I mean something like
this:

val byDst = rdd.map { case (src, dst, w) => dst -> (src, w) }
byDst.groupByKey.map { case (dst, edges) =>
  for {
    (src1, w1) <- edges
    (src2, w2) <- edges
  } {
    ??? // Do something.
  }
  ??? // Return something.
}

On Mon, Dec 8, 2014 at 3:28 PM, Koert Kuipers <koert@tresata.com> wrote:

> spark can do efficient joins if both RDDs have the same partitioner. so in
> case of self join I would recommend to create an rdd that has explicit
> partitioner and has been cached.
> On Dec 8, 2014 8:52 AM, "Theodore Vasiloudis" <
> theodoros.vasiloudis@gmail.com> wrote:
>
>> Hello all,
>>
>> I am working on a graph problem using vanilla Spark (not GraphX) and at
>> some
>> point I would like to do a
>> self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to
>> get
>> all pairs of incoming edges.
>>
>> Since this is the performance bottleneck for my code, I was wondering if
>> there any steps to take before
>> performing the self-join in order to make it as efficient as possible.
>>
>> In the  Learning Spark book
>> <
>> https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
>> >
>> for example, in the "Data partitioning" section they recommend
>> performing .partitionBy(new HashPartitioner(100)) on an RDD before joining
>> it with another.
>>
>> Are there any guidelines for optimizing self-join performance?
>>
>> Regards,
>> Theodore
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-self-joins-tp20576.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Mime
View raw message