spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Quality of documentation (rant)
Date Mon, 20 Jan 2014 20:08:08 GMT
Hi Ognen,

It’s true that the documentation is partly targeting Hadoop users, and that’s something
we need to fix. Perhaps the best solution would be some kind of tutorial on “here’s how
to set up Spark by hand on EC2”. However it also sounds like you ran into some issues with
S3 that it would be good to report separately.

To answer the specific questions:

> For example, the thing supports using S3 to get files but when you actually try to read
a large file, it just sits there and sits there and eventually comes back with an error that
really does not tell me anything (so the task was killed - why? there is nothing in the logs).
So, do I actually need an HDFS setup over S3 so it can support block access? Who knows, I
can't find anything.

This sounds like either a bug or somehow the S3 library requiring lots of memory to read a
block. There isn’t a separate way to run HDFS over S3. Hadoop just has different implementations
of “file systems”, one of which is S3. There’s a pointer to these versions at the bottom
of http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 but
it is indeed pretty hidden in the docs.

> Even basic questions I have to ask on this list - does Spark support parallel reads from
files in a shared filesystem? Someone answered - yes. Does this extend to S3? Who knows? Nowhere
to be found. Does it extend to S3 only if used through HDFS? Who knows.

Everything in Hadoop and Spark is read in parallel, including S3.

> Does Spark need a running Hadoop cluster to realize its full potential? Who knows, it
is not stated explicitly anywhere but any time I google stuff people mention Hadoop.

Not unless you want to use HDFS.

> Can Spark do EVERYTHING in standalone mode? The documentation is not explicit but it
leads you to believe it can (or maybe I am overly optimistic?).

Yes, there’s no difference on what you can run on Spark in the different deployment modes.
They’re just different ways to get tasks on a cluster.

Anyway, these are really good questions as I said, since the docs kind of target a Hadoop
audience. We can improve these both in the online docs and by having some kind of walk-throughs
or tutorial. Do you have any suggestions on how you’d like the docs structured to show this
stuff? E.g. should there be a separate section on S3, or different input sources?

One final thing — as someone mentioned, using Spark’s EC2 scripts to launch a cluster
is not a bad idea. We’ve supported those scripts pretty much since Spark was released and
they do a lot of the configuration for you. You can even pause/restart the cluster if you
want, etc.

Matei
Mime
View raw message