spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mrm <>
Subject Getting different answers running same line of code
Date Thu, 19 Jun 2014 15:54:35 GMT

I have had this issue for some time already, where I get different answers
when I run the same line of code twice. I have run some experiments to see
what is happening, please help me! Here is the code and the answers that I
get. I suspect I have a problem when reading large datasets from S3.

rd1 = sc.textFile('s3n://blabla')
rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))

Note: both filter1() and map1() are deterministic

rd2.count()  ==> 294928559
rd2.count()  ==> 294928559

So far so good, I get the same counts. Now when I unpersist rd1, that's when
I start getting problems!

rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))
rd2.count()  ==> 294928559
rd2.count()  ==> 294509501
rd2.count()  ==> 294679795

I would appreciate it if you could help me!


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message