No. It is a filter that splits a line in a json file and extracts a position for it - every run is the same.

That's what bothers me about this.


On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <> wrote:
 Does there are some non-deterministic codes in filter ? Such as Random.nextInt(). If so, the program lost the idempotent feature. You should specify a seed to it.

2014/1/24 Ognen Duzlevski <>

(Sorry for the sensationalist title) :)

If I run Spark on files from S3 and do basic transformation like:


I get one number (e.g. 40,000).

If I do the same on the same files from HDFS, the number spat out is completely different (VERY different - something like 13,000).

What would one do in a situation like this? How do I even go about figuring out what the problem is? This is run on a cluster of 15 instances on Amazon.


Best Regards
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China

"Le secret des grandes fortunes sans cause apparente est un crime oublié, parce qu'il a été proprement fait" - Honore de Balzac