spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Strange behavior of RDD.cartesian
Date Sat, 29 Mar 2014 03:04:13 GMT
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces
this?

Matei

On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:

> I forgot to mention that I don't really use all of my data. Instead I use a sample extracted
with randomSample. 
> 
> 
> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
> Hi all,
> 
> I notice that RDD.cartesian has a strange behavior with cached and uncached data. More
precisely, I have a set of data that I load with objectFile
> 
> val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")
> 
> Then I split it in two set depending on some criteria
> 
> 
> val part1 = data.filter(_._2 matches "view1")
> val part2 = data.filter(_._2 matches "view2")
> 
> 
> Finally, I compute the cartesian product of part1 and part2
> 
> val pair = part1.cartesian(part2)
> 
> 
> If every thing goes well I should have 
> 
> pair.count == part1.count * part2.count
> 
> But this is not the case if I don't cache part1 and part2.
> 
> What I was missing ? Does caching data mandatory in Spark ?
> 
> Cheers,
> 
> Jaonary
> 
> 
> 
> 


Mime
View raw message