spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kal El <pinu.datri...@yahoo.com>
Subject Re: Running K-Means on a cluster setup
Date Thu, 23 Jan 2014 12:20:32 GMT
Ok, so I took a basic code (that shows the clock), packed everything in a .jar file, included
the path to the jar file in "ADD_JAR" environment variable and launched a spark-shell on the
cluster.

How do I run the code from the jar file from the console ?



On Thursday, January 23, 2014 12:12 AM, Ewen Cheslack-Postava <me@ewencp.org> wrote:
 
I think Mayur pointed to 
that code because it includes the relevant initialization code you were 
asking about. Running on a cluster doesn't require much change: pass the spark:// address
of the master instead of "local" and add any jars 
containing your code. You could set the jars manually, but the linked 
code uses 
JavaSparkContext.jarOfClass(JavaKMeans.class) to get the right jar filename.

-Ewen


Kal El
>January 22, 2014 
2:02 PM
>please understand that the code from your link is completely useless to me. 
It's like someone is trying to solve a differential equation and you 
tell them what's the formula for the area of the circle. 
>
>
>i 
can do that with my code too (kmeans code). the idea is that i want to 
run it on a cluster ...
>
>
>
>On 
Wednesday, January 22, 2014 5:31 PM, Mayur Rustagi <mayur.rustagi@gmail.com> wrote:
> 
>I am sorry that is not a tutorial. You can take this source code: 
>
>
>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
>
>
>Sync and Build this project: 
>https://github.com/apache/incubator-spark/
>
>You should be able to call JavaKMeans class, 
Reynold may be able to shed some details on how to use it. 
>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>
>
>Regards
>Mayur
>
>
>
>
>
>
>
>
>Mayur Rustagi
>Ph: 
+919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:35 PM, Kal El <pinu.datriciu@yahoo.com> wrote:
>
>
>
>
>Mayur Rustagi
>January 22, 2014 
7:19 AM
>I am sorry that 
is not a tutorial. You can take this source code: 
>
>
>https://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/mllib/examples/JavaKMeans.java
>
>
>
>Sync and Build this project: 
>https://github.com/apache/incubator-spark/
>
>You should be able to call JavaKMeans class, Reynold may be able to shed 
some details on how to use it. 
>If you reach some where and get stuck post it back and I can try 
and help. I hope this helps.
>
>
>Regards
>Mayur
>
>
>
>
>
>
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>
>Kal El
>January 22, 2014 
7:05 AM
>@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic
presentation non related with actual running the algorithm
>
>
>@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials
>
>
>
>On Wednesday, January 22, 2014 4:59 PM, Mayur 
Rustagi <mayur.rustagi@gmail.com> wrote:
> 
>How 
about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: 
+919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com> wrote:
>
>
>
>
>Mayur Rustagi
>January 22, 2014 
6:58 AM
>How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
>Regards
>Mayur
>
>
>Mayur Rustagi
>Ph: +919632149971
>http://www.sigmoidanalytics.com
>https://twitter.com/mayur_rustagi
>
>
>
>
>
>Ognen Duzlevski
>January 22, 2014 
6:50 AM
>Hello,
>
>I have found that you generally need two separate pools of knowledge to 
be successful in this game :). One is to have enough knowledge of 
network topologies, systems, java, scala and whatever else to actually 
set up the whole system (esp. if your requirements are different than 
running on a local machine or in the ec2 cluster supported by the 
scripts that come with spark).
>
>The other is actual knowledge of the API and how it works and 
how to express and solve your problems using the primitives offered by 
spark.
>
>There is also a third: since you can supply any 
function to a spark primitive, you generally need to know scala or java 
(or python?) to actually solve your problem.
>
>I am not sure this list is viewed as appropriate place to 
offer advice on how to actually solve these problems. Not that I would 
mind seeing various solutions to various problems :) and also 
optimizations.
>
>For example, I am trying to do rudimentary retention analysis. I am a 
total beginner in the whole map/reduce way of solving problems. I have 
come up with a solution that is pretty slow but implemented in 5 or 6 
lines of code for the simplest problem. However, my files are 20 GB in 
size each, all json strings. Figuring out what the limiting factor is 
(network bandwidth is my suspicion since I am accessing things via S3 is
 my guess) is somewhat of a black magic to me at this point. I think for
 most of this stuff you will have to read the code. The bigger question 
after that is optimizing your solutions to be faster :). I would love to
 see practical tutorials on doing such things and I am willing to put my
 attempts at solving problems out there to eventually get cannibalized, 
ridiculed and reimplemented properly :).
>
>Sorry for this long winded email, it did not really answer your 
question anyway :)
>
>Ognen
>
>
>
>
>
Mime
View raw message