spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <>
Subject Re: Running K-Means on a cluster setup
Date Wed, 22 Jan 2014 14:58:30 GMT
How about ?

Mayur Rustagi
Ph: +919632149971
h <>ttp://

On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski

> Hello,
> I have found that you generally need two separate pools of knowledge to be
> successful in this game :). One is to have enough knowledge of network
> topologies, systems, java, scala and whatever else to actually set up the
> whole system (esp. if your requirements are different than running on a
> local machine or in the ec2 cluster supported by the scripts that come with
> spark).
> The other is actual knowledge of the API and how it works and how to
> express and solve your problems using the primitives offered by spark.
> There is also a third: since you can supply any function to a spark
> primitive, you generally need to know scala or java (or python?) to
> actually solve your problem.
> I am not sure this list is viewed as appropriate place to offer advice on
> how to actually solve these problems. Not that I would mind seeing various
> solutions to various problems :) and also optimizations.
> For example, I am trying to do rudimentary retention analysis. I am a
> total beginner in the whole map/reduce way of solving problems. I have come
> up with a solution that is pretty slow but implemented in 5 or 6 lines of
> code for the simplest problem. However, my files are 20 GB in size each,
> all json strings. Figuring out what the limiting factor is (network
> bandwidth is my suspicion since I am accessing things via S3 is my guess)
> is somewhat of a black magic to me at this point. I think for most of this
> stuff you will have to read the code. The bigger question after that is
> optimizing your solutions to be faster :). I would love to see practical
> tutorials on doing such things and I am willing to put my attempts at
> solving problems out there to eventually get cannibalized, ridiculed and
> reimplemented properly :).
> Sorry for this long winded email, it did not really answer your question
> anyway :)
> Ognen
> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <> wrote:
>> I have created a cluster setup with 2 workers (one of them is also the
>> master)
>> Can anyone help me with a tutorial on how to run K-Means for example on
>> this cluster (it would be better to run it from outside the cluster command
>> line)?
>> I am mostly interested on how do I initiate the sparkcontext (what jars
>> do I need to add ? :
>> new SparkContext(master, appName, [sparkHome], [jars])) and what other
>> steps I need to run.
>> I am using the standalone spark cluster.
>> Thanks

View raw message