mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: How to use kmeans clustering algorithm of Mahout
Date Wed, 12 Sep 2012 07:24:02 GMT
I think you will need these two commands ( in the same order ) :

seqdirectory : Generate sequence files (of Text) from a directory
seq2sparse: Sparse Vector generation from Text sequence files

On 12-09-2012 12:28, Don Tan wrote:
> I think I didn't explain clear enough and sorry for that.
>
> The example showed before is a part of my data.
>
> Each line is a user profile, for example, the first row is the features of
> a user. And I want to apply k-means to this data.
>
> I need to create a file saves all users profile as sparse vector and put
> them in mahout k-means algorithm, how can I do that?
>
>   Thanks for your advice!
>
> Don Tan
>
> 2012/9/12 Paritosh Ranjan <pranjan@xebia.com>
>
>> I could not understand the question correctly, can you explain more?
>> Here you can find how to use kmeans algorithm of Mahout
>> https://cwiki.apache.org/**confluence/display/MAHOUT/K-**Means+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering>
>> .
>>
>>
>> On 12-09-2012 11:43, Don.Tan wrote:
>>
>>> Aloha!
>>>
>>>     I am new to hadoop and mahout, but I have set up the hadoop cluster.
>>>
>>>     I am working on a clustering task lately. I think I could not make it
>>> quickly because I don't know too much about how to deal with massive data (
>>> my data contains 1400000 user and 50000 features..plus that is sparse ).
>>>
>>>     Could you tell me how deal with that? A slice of data is here:
>>>
>>> 167555,152622,162252,79481,**66540,41942,75500,167898,**
>>> 61923,182083,180681,181135,**174449,166439,167307,174126,**87800,2826,
>>>      98660,158620,33900,
>>> 4780,13922,45040,159210,26423,**1471,68200,70402,109721,**
>>> 145860,23740,5818,15087,47861,**158620,170482,170161,39120,**
>>> 164514,5854,169183,151229,**171110,163457,4356,21363,1307,**78105,1322,177011,167822,
>>>
>>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>>
>>>      example above contains 4 user's data and each number is nominal
>>> (denoting that is a kind of behavior of user, e.s, user 2 has
>>> "98660","158620","33900" )
>>>
>>>      Please tell me how to work on that or which documents should I read..
>>>
>>>
>>>      Thx!
>>>
>>>     Don Tan
>>>
>>
>>



Mime
View raw message