spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <>
Subject Re: Running K-Means on a cluster setup
Date Wed, 22 Jan 2014 15:05:38 GMT

On Wed, Jan 22, 2014 at 2:58 PM, Mayur Rustagi <>wrote:

> How about ?
> Regards
> Mayur
> Mayur Rustagi
> Ph: +919632149971
> h <>ttp://
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <
> > wrote:
>> Hello,
>> I have found that you generally need two separate pools of knowledge to
>> be successful in this game :). One is to have enough knowledge of network
>> topologies, systems, java, scala and whatever else to actually set up the
>> whole system (esp. if your requirements are different than running on a
>> local machine or in the ec2 cluster supported by the scripts that come with
>> spark).
>> The other is actual knowledge of the API and how it works and how to
>> express and solve your problems using the primitives offered by spark.
>> There is also a third: since you can supply any function to a spark
>> primitive, you generally need to know scala or java (or python?) to
>> actually solve your problem.
>> I am not sure this list is viewed as appropriate place to offer advice on
>> how to actually solve these problems. Not that I would mind seeing various
>> solutions to various problems :) and also optimizations.
>> For example, I am trying to do rudimentary retention analysis. I am a
>> total beginner in the whole map/reduce way of solving problems. I have come
>> up with a solution that is pretty slow but implemented in 5 or 6 lines of
>> code for the simplest problem. However, my files are 20 GB in size each,
>> all json strings. Figuring out what the limiting factor is (network
>> bandwidth is my suspicion since I am accessing things via S3 is my guess)
>> is somewhat of a black magic to me at this point. I think for most of this
>> stuff you will have to read the code. The bigger question after that is
>> optimizing your solutions to be faster :). I would love to see practical
>> tutorials on doing such things and I am willing to put my attempts at
>> solving problems out there to eventually get cannibalized, ridiculed and
>> reimplemented properly :).
>> Sorry for this long winded email, it did not really answer your question
>> anyway :)
>> Ognen
>> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <> wrote:
>>> I have created a cluster setup with 2 workers (one of them is also the
>>> master)
>>> Can anyone help me with a tutorial on how to run K-Means for example on
>>> this cluster (it would be better to run it from outside the cluster command
>>> line)?
>>> I am mostly interested on how do I initiate the sparkcontext (what jars
>>> do I need to add ? :
>>> new SparkContext(master, appName, [sparkHome], [jars])) and what other
>>> steps I need to run.
>>> I am using the standalone spark cluster.
>>> Thanks

View raw message