spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Strange behavior of RDD.cartesian
Date Sat, 29 Mar 2014 18:12:14 GMT
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.

Sent from my mobile phone
On Mar 28, 2014 8:04 PM, "Matei Zaharia" <matei.zaharia@gmail.com> wrote:

> Weird, how exactly are you pulling out the sample? Do you have a small
> program that reproduces this?
>
> Matei
>
> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
>
> I forgot to mention that I don't really use all of my data. Instead I use
> a sample extracted with randomSample.
>
>
> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaonary@gmail.com>wrote:
>
>> Hi all,
>>
>> I notice that RDD.cartesian has a strange behavior with cached and
>> uncached data. More precisely, I have a set of data that I load with
>> objectFile
>>
>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>>
>> Then I split it in two set depending on some criteria
>>
>>
>> *val part1 = data.filter(_._2 matches "view1")*
>> *val part2 = data.filter(_._2 matches "view2")*
>>
>>
>> Finally, I compute the cartesian product of part1 and part2
>>
>> *val pair = part1.cartesian(part2)*
>>
>>
>> If every thing goes well I should have
>>
>> *pair.count == part1.count * part2.count*
>>
>> But this is not the case if I don't cache part1 and part2.
>>
>> What I was missing ? Does caching data mandatory in Spark ?
>>
>> Cheers,
>>
>> Jaonary
>>
>>
>>
>>
>
>

Mime
View raw message