spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Re: Running K-Means on a cluster setup
Date Wed, 22 Jan 2014 14:50:23 GMT

I have found that you generally need two separate pools of knowledge to be
successful in this game :). One is to have enough knowledge of network
topologies, systems, java, scala and whatever else to actually set up the
whole system (esp. if your requirements are different than running on a
local machine or in the ec2 cluster supported by the scripts that come with

The other is actual knowledge of the API and how it works and how to
express and solve your problems using the primitives offered by spark.

There is also a third: since you can supply any function to a spark
primitive, you generally need to know scala or java (or python?) to
actually solve your problem.

I am not sure this list is viewed as appropriate place to offer advice on
how to actually solve these problems. Not that I would mind seeing various
solutions to various problems :) and also optimizations.

For example, I am trying to do rudimentary retention analysis. I am a total
beginner in the whole map/reduce way of solving problems. I have come up
with a solution that is pretty slow but implemented in 5 or 6 lines of code
for the simplest problem. However, my files are 20 GB in size each, all
json strings. Figuring out what the limiting factor is (network bandwidth
is my suspicion since I am accessing things via S3 is my guess) is somewhat
of a black magic to me at this point. I think for most of this stuff you
will have to read the code. The bigger question after that is optimizing
your solutions to be faster :). I would love to see practical tutorials on
doing such things and I am willing to put my attempts at solving problems
out there to eventually get cannibalized, ridiculed and reimplemented
properly :).

Sorry for this long winded email, it did not really answer your question
anyway :)

On Wed, Jan 22, 2014 at 2:35 PM, Kal El <> wrote:

> I have created a cluster setup with 2 workers (one of them is also the
> master)
> Can anyone help me with a tutorial on how to run K-Means for example on
> this cluster (it would be better to run it from outside the cluster command
> line)?
> I am mostly interested on how do I initiate the sparkcontext (what jars do
> I need to add ? :
> new SparkContext(master, appName, [sparkHome], [jars])) and what other
> steps I need to run.
> I am using the standalone spark cluster.
> Thanks

View raw message