Hi Matei, thanks for replying!

On Mon, Jan 20, 2014 at 8:08 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
It’s true that the documentation is partly targeting Hadoop users, and that’s something we need to fix. Perhaps the best solution would be some kind of tutorial on “here’s how to set up Spark by hand on EC2”. However it also sounds like you ran into some issues with S3 that it would be good to report separately.

To answer the specific questions:

> For example, the thing supports using S3 to get files but when you actually try to read a large file, it just sits there and sits there and eventually comes back with an error that really does not tell me anything (so the task was killed - why? there is nothing in the logs). So, do I actually need an HDFS setup over S3 so it can support block access? Who knows, I can't find anything.

This sounds like either a bug or somehow the S3 library requiring lots of memory to read a block. There isn’t a separate way to run HDFS over S3. Hadoop just has different implementations of “file systems”, one of which is S3. There’s a pointer to these versions at the bottom of http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 but it is indeed pretty hidden in the docs.

Hmmm. Maybe a bug then. If I read a small 600 byte file via the s3n:// uri - it works on a spark cluster. If I try a 20GB file it just sits and sits and sits frozen. Is there anything I can do to instrument this and figure out what is going on?


> Even basic questions I have to ask on this list - does Spark support parallel reads from files in a shared filesystem? Someone answered - yes. Does this extend to S3? Who knows? Nowhere to be found. Does it extend to S3 only if used through HDFS? Who knows.

Everything in Hadoop and Spark is read in parallel, including S3.

OK good to know!

> Does Spark need a running Hadoop cluster to realize its full potential? Who knows, it is not stated explicitly anywhere but any time I google stuff people mention Hadoop.

Not unless you want to use HDFS.

Ahh, OK. I don't particularly want HDFS but I suspect I will need it since it seems to be the only "free" distributed parallel FS. I suspect running it over EBS volumes is probably as slow as molasses though. Right now the s3:// freezing bug is a show stopper for me and I am considering putting the ephemeral storage on all the nodes in the spark cluster in some kind of a distributed file system like GPFS or Lustre or https://code.google.com/p/mogilefs/ to provide a "shared" file system for all the nodes. It is next to impossible to find online what the standard practices in the industry are for this kind of a setup so I guess I am going to set my own industry standards ;)

Anyway, these are really good questions as I said, since the docs kind of target a Hadoop audience. We can improve these both in the online docs and by having some kind of walk-throughs or tutorial. Do you have any suggestions on how you’d like the docs structured to show this stuff? E.g. should there be a separate section on S3, or different input sources?

Not sure. For starters it would be nice to document the real use cases.

I am more than happy (and I think the people I work for are happy too) to document the pipeline I am setting up. In the process I have found that the industry is remarkably "tight lipped" as to how to do these things in practice. For example, what if you want to expose a point on the internet where you can send millions of data points into your "firehose"? What do you use? How? I have people people recommending kafka but even those people don't exactly say HOW. I have gone the route of elactic load balancing with autoscaling exposing a bunch of mongrel2 instances running zeromq handlers that ingest data and then bounce it into S3 for persistence and into a Spark cluster for real-time analytics but also for post fact analytics. While I have demonstrated the whole pipeline on a toy example, I am now trying to test it in "real life" with historic data that we have from our "previous" data provider - about 1-2 TB of data so far in 20-30GB files. Unfortunately I have not been able to get past the f = textFile(s3://something), f.count basic test on a 20GB file on Amazon S3. I have a test cluster of about 16 m1.xlarge instances that is just sitting there spinning :)

One final thing — as someone mentioned, using Spark’s EC2 scripts to launch a cluster is not a bad idea. We’ve supported those scripts pretty much since Spark was released and they do a lot of the configuration for you. You can even pause/restart the cluster if you want, etc.

Yes, but things get complicated in people's setups. I run mine in a VPC that exposes only one point of entry - the elastic load balancer that takes the traffic from the "outside" and sends it to the "inside" of the VPC where the analytics/spark live. I imagine this would be a common use scenario for a company that has millions of devices hitting their data entry point(s) where the data is important in terms of privacy, for example - the VPC offers much more than EC2 with security groups (and is easier to manage).