It’s true that the documentation is partly targeting Hadoop users, and that’s something we need to fix. Perhaps the best solution would be some kind of tutorial on “here’s how to set up Spark by hand on EC2”. However it also sounds like you ran into some issues with S3 that it would be good to report separately.
To answer the specific questions:
This sounds like either a bug or somehow the S3 library requiring lots of memory to read a block. There isn’t a separate way to run HDFS over S3. Hadoop just has different implementations of “file systems”, one of which is S3. There’s a pointer to these versions at the bottom of http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 but it is indeed pretty hidden in the docs.
> For example, the thing supports using S3 to get files but when you actually try to read a large file, it just sits there and sits there and eventually comes back with an error that really does not tell me anything (so the task was killed - why? there is nothing in the logs). So, do I actually need an HDFS setup over S3 so it can support block access? Who knows, I can't find anything.
Everything in Hadoop and Spark is read in parallel, including S3.
> Even basic questions I have to ask on this list - does Spark support parallel reads from files in a shared filesystem? Someone answered - yes. Does this extend to S3? Who knows? Nowhere to be found. Does it extend to S3 only if used through HDFS? Who knows.
Not unless you want to use HDFS.
> Does Spark need a running Hadoop cluster to realize its full potential? Who knows, it is not stated explicitly anywhere but any time I google stuff people mention Hadoop.
Anyway, these are really good questions as I said, since the docs kind of target a Hadoop audience. We can improve these both in the online docs and by having some kind of walk-throughs or tutorial. Do you have any suggestions on how you’d like the docs structured to show this stuff? E.g. should there be a separate section on S3, or different input sources?
One final thing — as someone mentioned, using Spark’s EC2 scripts to launch a cluster is not a bad idea. We’ve supported those scripts pretty much since Spark was released and they do a lot of the configuration for you. You can even pause/restart the cluster if you want, etc.