spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: Running K-Means on a cluster setup
Date Wed, 22 Jan 2014 15:05:38 GMT
Nice!


On Wed, Jan 22, 2014 at 2:58 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:

> How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello,
>>
>> I have found that you generally need two separate pools of knowledge to
>> be successful in this game :). One is to have enough knowledge of network
>> topologies, systems, java, scala and whatever else to actually set up the
>> whole system (esp. if your requirements are different than running on a
>> local machine or in the ec2 cluster supported by the scripts that come with
>> spark).
>>
>> The other is actual knowledge of the API and how it works and how to
>> express and solve your problems using the primitives offered by spark.
>>
>> There is also a third: since you can supply any function to a spark
>> primitive, you generally need to know scala or java (or python?) to
>> actually solve your problem.
>>
>> I am not sure this list is viewed as appropriate place to offer advice on
>> how to actually solve these problems. Not that I would mind seeing various
>> solutions to various problems :) and also optimizations.
>>
>> For example, I am trying to do rudimentary retention analysis. I am a
>> total beginner in the whole map/reduce way of solving problems. I have come
>> up with a solution that is pretty slow but implemented in 5 or 6 lines of
>> code for the simplest problem. However, my files are 20 GB in size each,
>> all json strings. Figuring out what the limiting factor is (network
>> bandwidth is my suspicion since I am accessing things via S3 is my guess)
>> is somewhat of a black magic to me at this point. I think for most of this
>> stuff you will have to read the code. The bigger question after that is
>> optimizing your solutions to be faster :). I would love to see practical
>> tutorials on doing such things and I am willing to put my attempts at
>> solving problems out there to eventually get cannibalized, ridiculed and
>> reimplemented properly :).
>>
>> Sorry for this long winded email, it did not really answer your question
>> anyway :)
>> Ognen
>>
>>
>> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <pinu.datriciu@yahoo.com> wrote:
>>
>>> I have created a cluster setup with 2 workers (one of them is also the
>>> master)
>>>
>>> Can anyone help me with a tutorial on how to run K-Means for example on
>>> this cluster (it would be better to run it from outside the cluster command
>>> line)?
>>>
>>> I am mostly interested on how do I initiate the sparkcontext (what jars
>>> do I need to add ? :
>>> new SparkContext(master, appName, [sparkHome], [jars])) and what other
>>> steps I need to run.
>>>
>>> I am using the standalone spark cluster.
>>>
>>> Thanks
>>>
>>>
>>>
>>
>

Mime
View raw message