spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Non-deterministic behavior in spark
Date Fri, 24 Jan 2014 10:46:07 GMT
Hello,

(Sorry for the sensationalist title) :)

If I run Spark on files from S3 and do basic transformation like:

textfile()
filter
groupByKey
count

I get one number (e.g. 40,000).

If I do the same on the same files from HDFS, the number spat out is
completely different (VERY different - something like 13,000).

What would one do in a situation like this? How do I even go about figuring
out what the problem is? This is run on a cluster of 15 instances on Amazon.

Thanks,
Ognen

Mime
View raw message