Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this?

Matei

On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:

I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. 


On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
Hi all,

I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile

val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")

Then I split it in two set depending on some criteria


val part1 = data.filter(_._2 matches "view1")
val part2 = data.filter(_._2 matches "view2")


Finally, I compute the cartesian product of part1 and part2

val pair = part1.cartesian(part2)


If every thing goes well I should have 

pair.count == part1.count * part2.count

But this is not the case if I don't cache part1 and part2.

What I was missing ? Does caching data mandatory in Spark ?

Cheers,

Jaonary