Ah okay, kind of weird that it worked with a small file. Maybe that was being done locally since the file was small.

If you do run into further issues with S3, one other idea is to build Spark against a newer version of the Hadoop client library (Spark uses Hadoop’s data source classes to read data, so its S3 support comes from that library). You can do this by rebuilding Spark with

SPARK_HADOOP_VERSION=2.2.0 sbt/sbt clean assembly


On Jan 21, 2014, at 3:04 AM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:

On Mon, Jan 20, 2014 at 11:05 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:

Thanks. I will try that but your assumption is that something is failing in an obvious way with a message. By the look of the spark-shell - just frozen I would say something is "stuck".  Will report back.

Given the suspicious nature of the "freezing" of the shell, it looked to me like a timeout or some kind of a "wait".

I whipped out tcpdump on a node in the cluster and noticed that the nodes try to connect back to master on some (random?) port. I realized that my VPC security group was too restrictive. As soon as I allowed all tcp and udp traffic within the VPC, it magically worked ;)

So, problem solved. It is not a bug after all, just traffic being blocked.

In any case, I am documenting this as I go. As soon as I have a viable "data pipeline" in the VPC I will publish something for everyone to read, I figure another experience wouldn't hurt.