Mayur, Ankur:

Thanks for the feedback. My basic requirement is for all my stuff to run in the VPC - I have obligations such as client data protection and the VPC just makes sense.

My basic setup is an exposed 443 port via ELB. Behind this are instances in the VPC that run mongrel2 - an http server that talks zeromq to background zeromq handlers (which can be in any language). The handlers run in an autoscaling group and their primary task is two-fold: 1) bounce all the incoming data into persistent storage (like S3) and 2) bounce all the data into a spark cluster. Spark data ingestion is set up via zeromq for streaming analytics but we have a bunch of data from the service we used in the past that we have to go to for analytics (terabytes of data).

My basic requirement is to set everything up myself and understand it. For testing purposes my cluster has 15 xlarge instances and I guess I will just set up a hadoop cluster to run over these instances for the purposes of getting the benefits of HDFS. I would then set up hdfs over S3 with blocks.

Does my last paragraph sound reasonable? Would I be losing anything in this approach?

Finally, I have been using computers since the Commodore 64 ;). I understand the need for abstraction - tasks are getting more complicated and we need more "tools on top of tools" approaches to be able to do these complicated tasks. However, I think this kind of an approach should rely on exhaustive documentation and use cases presentations. I even bought the book recommended on the spark website and it is truly a waste of money (sorry to say) - unfinished and very modest in its goals.

I am happy to document my story as a (I hope ;) typical use case of someone trying to do streaming and "after the fact" analytics for millions of users/data points and still growing every day. So yeah, sign me up :)


On Sun, Jan 19, 2014 at 2:36 PM, Mayur Rustagi <> wrote:
You an run Spark independent of hadoop, but that is a lie :) it is a big confusing. Short fuse to all your problems would be to use CDH if you have that choice and are starting a cluster. They bundle in spark assisted by data bricks. Its ride should be a little smoother atleast from install and pre-built library perspective. 

If you are free to start your own cluster another good gratifying experience is to use spark-ec2 tools. They are pretty well built and you should have a cluster ready within an hr or so. I guess the image is pretty well maintained and runs well. 

Hadoop has 2 parts HDFS(Storage) and Yarn(processing). Spark has independent processing engine as well as yarn integrated processing engine as well as SIMR (spark on map reduce for hadoop 1.0). All of these take files from HDFS, hence even when spark doesnt need hadoop it relies on hdfs, The entire process is not really well documented and the lingo used is also not very clear. Application->Stage-> task  etc. There was a lot of good discussion in Spark Summit 2014 videos. Last I am trying to setup documentation steps and a distribution around spark  in You are free to contribute and help me out. Just drop me a mail. 


On Sun, Jan 19, 2014 at 6:55 PM, Ankur Chauhan <> wrote:
Hi ognen,

I am in the same boat as you are. I actually work on a project that basically does this exact same thing and tried to aggregations by spark but it ended up with me trying again and again to do simple things like reading from s3 and such and failing.
I have a little bit of knowledge about spark so I may be able to answer some basic stuff. (I just gave up on spark after wasting 2-3 weeks on it). Plus reading through the code is of little or no use as there is mostly no comments anywhere relevant and the scala syntax is pretty unintuitive for me.

I would be more than happy to assist and work on documentation with you. Do you have any ideas on how you want to go about it or some existing plans?

-- Ankur

> On Jan 19, 2014, at 4:52, Ognen Duzlevski <> wrote:
> Hello,
> I have been trying to set up a running spark cluster for a while now. Being new to all this, I have tried to rely on the documentation, however, I find it sorely lacking on a few fronts.
> For example, I think it has a number of built-in assumptions about a person's knowledge of Hadoop or Mesos. I have been using and programming computers for almost two decades so I don't think I am a total idiot when it comes to these things, however, I am left with staring at the console wondering what the hell is going on.
> For example, the thing supports using S3 to get files but when you actually try to read a large file, it just sits there and sits there and eventually comes back with an error that really does not tell me anything (so the task was killed - why? there is nothing in the logs). So, do I actually need an HDFS setup over S3 so it can support block access? Who knows, I can't find anything.
> Even basic questions I have to ask on this list - does Spark support parallel reads from files in a shared filesystem? Someone answered - yes. Does this extend to S3? Who knows? Nowhere to be found. Does it extend to S3 only if used through HDFS? Who knows.
> Does Spark need a running Hadoop cluster to realize its full potential? Who knows, it is not stated explicitly anywhere but any time I google stuff people mention Hadoop.
> Can Spark do EVERYTHING in standalone mode? The documentation is not explicit but it leads you to believe it can (or maybe I am overly optimistic?).
> So what does one do when they have a problem? How do they instrument stuff?
> I do not want to just rant - I am willing to put work into writing proper documentation for something that is advertised to work but in practice ends up costing you weeks of hunting for crap left and right and feeling lost. I am going through this process and would be happy to document a whole story of setting up a data analysis pipeline from aggregating data via https exposed over an ELB to sending it to a spark cluster via zeromq collectors to actual Spark cluster setup to.... - is anyone willing to help answer my questions so we can all benefit from this "hair-greying" experience? ;)
> Thanks!
> Ognen