spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Quality of documentation (rant)
Date Sun, 19 Jan 2014 12:52:29 GMT

I have been trying to set up a running spark cluster for a while now. Being
new to all this, I have tried to rely on the documentation, however, I find
it sorely lacking on a few fronts.

For example, I think it has a number of built-in assumptions about a
person's knowledge of Hadoop or Mesos. I have been using and programming
computers for almost two decades so I don't think I am a total idiot when
it comes to these things, however, I am left with staring at the console
wondering what the hell is going on.

For example, the thing supports using S3 to get files but when you actually
try to read a large file, it just sits there and sits there and eventually
comes back with an error that really does not tell me anything (so the task
was killed - why? there is nothing in the logs). So, do I actually need an
HDFS setup over S3 so it can support block access? Who knows, I can't find

Even basic questions I have to ask on this list - does Spark support
parallel reads from files in a shared filesystem? Someone answered - yes.
Does this extend to S3? Who knows? Nowhere to be found. Does it extend to
S3 only if used through HDFS? Who knows.

Does Spark need a running Hadoop cluster to realize its full potential? Who
knows, it is not stated explicitly anywhere but any time I google stuff
people mention Hadoop.

Can Spark do EVERYTHING in standalone mode? The documentation is not
explicit but it leads you to believe it can (or maybe I am overly

So what does one do when they have a problem? How do they instrument stuff?

I do not want to just rant - I am willing to put work into writing proper
documentation for something that is advertised to work but in practice ends
up costing you weeks of hunting for crap left and right and feeling lost. I
am going through this process and would be happy to document a whole story
of setting up a data analysis pipeline from aggregating data via https
exposed over an ELB to sending it to a spark cluster via zeromq collectors
to actual Spark cluster setup to.... - is anyone willing to help answer my
questions so we can all benefit from this "hair-greying" experience? ;)


View raw message