spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aureliano Buendia <buendia...@gmail.com>
Subject Re: Quality of documentation (rant)
Date Wed, 22 Jan 2014 22:36:25 GMT
I have to second this.

Spark documentations make a lot of non-obvious assumptions. On top of this,
when asking a question in the mailing list, you are often referred to those
documentations by the developers.


On Sun, Jan 19, 2014 at 12:52 PM, Ognen Duzlevski
<ognen@nengoiksvelzud.com>wrote:

> Hello,
>
> I have been trying to set up a running spark cluster for a while now.
> Being new to all this, I have tried to rely on the documentation, however,
> I find it sorely lacking on a few fronts.
>
> For example, I think it has a number of built-in assumptions about a
> person's knowledge of Hadoop or Mesos. I have been using and programming
> computers for almost two decades so I don't think I am a total idiot when
> it comes to these things, however, I am left with staring at the console
> wondering what the hell is going on.
>
> For example, the thing supports using S3 to get files but when you
> actually try to read a large file, it just sits there and sits there and
> eventually comes back with an error that really does not tell me anything
> (so the task was killed - why? there is nothing in the logs). So, do I
> actually need an HDFS setup over S3 so it can support block access? Who
> knows, I can't find anything.
>
> Even basic questions I have to ask on this list - does Spark support
> parallel reads from files in a shared filesystem? Someone answered - yes.
> Does this extend to S3? Who knows? Nowhere to be found. Does it extend to
> S3 only if used through HDFS? Who knows.
>
> Does Spark need a running Hadoop cluster to realize its full potential?
> Who knows, it is not stated explicitly anywhere but any time I google stuff
> people mention Hadoop.
>
> Can Spark do EVERYTHING in standalone mode? The documentation is not
> explicit but it leads you to believe it can (or maybe I am overly
> optimistic?).
>
> So what does one do when they have a problem? How do they instrument stuff?
>
> I do not want to just rant - I am willing to put work into writing proper
> documentation for something that is advertised to work but in practice ends
> up costing you weeks of hunting for crap left and right and feeling lost. I
> am going through this process and would be happy to document a whole story
> of setting up a data analysis pipeline from aggregating data via https
> exposed over an ELB to sending it to a spark cluster via zeromq collectors
> to actual Spark cluster setup to.... - is anyone willing to help answer my
> questions so we can all benefit from this "hair-greying" experience? ;)
>
> Thanks!
> Ognen
>
>
>

Mime
View raw message