spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankur Chauhan <achau...@brightcove.com>
Subject Re: Quality of documentation (rant)
Date Sun, 19 Jan 2014 13:25:09 GMT
Hi ognen,

I am in the same boat as you are. I actually work on a project that basically does this exact
same thing and tried to aggregations by spark but it ended up with me trying again and again
to do simple things like reading from s3 and such and failing. 
I have a little bit of knowledge about spark so I may be able to answer some basic stuff.
(I just gave up on spark after wasting 2-3 weeks on it). Plus reading through the code is
of little or no use as there is mostly no comments anywhere relevant and the scala syntax
is pretty unintuitive for me. 

I would be more than happy to assist and work on documentation with you. Do you have any ideas
on how you want to go about it or some existing plans?

-- Ankur 

> On Jan 19, 2014, at 4:52, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:
> 
> Hello,
> 
> I have been trying to set up a running spark cluster for a while now. Being new to all
this, I have tried to rely on the documentation, however, I find it sorely lacking on a few
fronts.
> 
> For example, I think it has a number of built-in assumptions about a person's knowledge
of Hadoop or Mesos. I have been using and programming computers for almost two decades so
I don't think I am a total idiot when it comes to these things, however, I am left with staring
at the console wondering what the hell is going on.
> 
> For example, the thing supports using S3 to get files but when you actually try to read
a large file, it just sits there and sits there and eventually comes back with an error that
really does not tell me anything (so the task was killed - why? there is nothing in the logs).
So, do I actually need an HDFS setup over S3 so it can support block access? Who knows, I
can't find anything.
> 
> Even basic questions I have to ask on this list - does Spark support parallel reads from
files in a shared filesystem? Someone answered - yes. Does this extend to S3? Who knows? Nowhere
to be found. Does it extend to S3 only if used through HDFS? Who knows.
> 
> Does Spark need a running Hadoop cluster to realize its full potential? Who knows, it
is not stated explicitly anywhere but any time I google stuff people mention Hadoop.
> 
> Can Spark do EVERYTHING in standalone mode? The documentation is not explicit but it
leads you to believe it can (or maybe I am overly optimistic?).
> 
> So what does one do when they have a problem? How do they instrument stuff?
> 
> I do not want to just rant - I am willing to put work into writing proper documentation
for something that is advertised to work but in practice ends up costing you weeks of hunting
for crap left and right and feeling lost. I am going through this process and would be happy
to document a whole story of setting up a data analysis pipeline from aggregating data via
https exposed over an ELB to sending it to a spark cluster via zeromq collectors to actual
Spark cluster setup to.... - is anyone willing to help answer my questions so we can all benefit
from this "hair-greying" experience? ;)
> 
> Thanks!
> Ognen
> 
> 

Mime
View raw message