spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Ho <e...@analyticsmd.com>
Subject Re: How to do nested for-each loops across RDDs ?
Date Mon, 15 Aug 2016 20:29:53 GMT
Thanks Daniel.
Do you have any code fragments on using CoGroups or Joins across 2 RDDs ?
I don't think that index would help much because this is an N x M
operation, examining each cell of each RDD.  Each comparison is complex as
it needs to peer into a complex JSON


On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman <daniel.imberman@gmail.com>
wrote:

> There's no real way of doing nested for-loops with RDD's because the whole
> idea is that you could have so much data in the RDD that it would be really
> ugly to store it all in one worker.
>
> There are, however, ways to handle what you're asking about.
>
> I would personally use something like CoGroup or Join between the two
> RDDs. if index matters, you can use ZipWithIndex on both before you join
> and then see which indexes match up.
>
> On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <eric@analyticsmd.com> wrote:
>
>> I've nested foreach loops like this:
>>
>>   for i in A[i] do:
>>     for j in B[j] do:
>>       append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>>
>> Each element in A or B is some complex structure like:
>> (
>>   some complex JSON,
>>   some number
>> )
>>
>> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
>> how would my code look ?
>> Are there any RRD operators that would allow me to loop thru both RRDs
>> like the above procedural code ?
>> I can't find any RRD operators nor any code fragments that would allow me
>> to do this.
>>
>> Thing is: by that time I composed RRD(A), this RRD would have contain
>> elements in array B as well as array A.
>> Same argument for RRD(B).
>>
>> Any pointers much appreciated.
>>
>> Thanks.
>>
>>
>> --
>>
>> -eric ho
>>
>>


-- 

-eric ho

Mime
View raw message