spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kal El <pinu.datri...@yahoo.com>
Subject Re: Running K-Means on a cluster setup
Date Wed, 22 Jan 2014 22:02:14 GMT
please understand that the code from your link is completely useless to me. It's like someone
is trying to solve a differential equation and you tell them what's the formula for the area
of the circle. 

i can do that with my code too (kmeans code). the idea is that i want to run it on a cluster
...



On Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi <mayur.rustagi@gmail.com> wrote:
 
I am sorry that is not a tutorial. You can take this source code: 

https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java


Sync and Build this project: 
https://github.com/apache/incubator-spark/

You should be able to call JavaKMeans class, Reynold may be able to shed some details on how
to use it. 
If you reach some where and get stuck post it back and I can try and help. I hope this helps.

Regards
Mayur





Mayur Rustagi
Ph: +919632149971
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pinu.datriciu@yahoo.com> wrote:

@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation
non related with actual running the algorithm
>
>
>@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials
>
>
>
>On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <mayur.rustagi@gmail.com>
wrote:
> 
>How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:
>
>Hello,
>>
>>I have found that you generally need two separate pools of knowledge to be successful
in this game :). One is to have enough knowledge of network topologies, systems, java, scala
and whatever else to actually set up the whole system (esp. if your requirements are different
than running on a local machine or in the ec2 cluster supported by the scripts that come with
spark).
>>
>>The other is actual knowledge of the API and how it works and how to express and solve
your problems using the primitives offered by spark.
>>
>>There is also a third: since you can supply any function to a spark primitive, you
generally need to know scala or java (or python?) to actually solve your problem.
>>
>>I am not sure this list is viewed as appropriate place to offer advice on how to actually
solve these problems. Not that I would mind seeing various solutions to various problems :)
and also optimizations.
>>
>>For example, I am trying to do rudimentary retention analysis. I am a total beginner
in the whole map/reduce way of solving problems. I have come up with a solution that is pretty
slow but implemented in 5 or 6 lines of code for the simplest problem. However, my files are
20 GB in size each, all json strings. Figuring out what the limiting factor is (network bandwidth
is my suspicion since I am accessing things via S3 is my guess) is somewhat of a black magic
to me at this point. I think for most of this stuff you will have to read the code. The bigger
question after that is optimizing your solutions to be faster :). I would love to see practical
tutorials on doing such things and I am willing to put my attempts at solving problems out
there to eventually get cannibalized, ridiculed and reimplemented properly :).
>>
>>Sorry for this long winded email, it did not really answer your question anyway :)
>>
>>Ognen
>>
>>
>>
>>
>>On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pinu.datriciu@yahoo.com> wrote:
>>
>>I have created a cluster setup with 2 workers (one of them is also the master)
>>>
>>>
>>>Can anyone help me with a tutorial on how to run K-Means for example on this cluster
(it would be better to run it from outside the cluster command line)?
>>>
>>>
>>>I am mostly interested on how do I initiate the sparkcontext (what jars do I need
to add ? :
>>>newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I need
to run.
>>>
>>>
>>>I am using the standalone spark cluster.
>>>
>>>
>>>Thanks
>>>
>>>
>>>
>>>
>>
>
>
>
Mime
View raw message