mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: How to use kmeans clustering algorithm of Mahout
Date Wed, 12 Sep 2012 10:29:17 GMT
Can you explain something about the error and provide the stacktrace ?

On 12-09-2012 14:22, Don.Tan wrote:
> The original data is here:
>
> [hadoop@datamining ~]$ hadoop fs -ls /home/test/test
> Found 1 items
> -rw-r--r--   1 hadoop supergroup  129213799 2012-09-12 15:45 
> /home/test/test/result
>
> After I used "mahout seqdirectory -i /home/test/test/ -o 
> /home/test/result/ -c UTF-8", get this:
>
> [hadoop@datamining ~]$ hadoop fs -ls /home/test/result
> Found 1 items
> -rw-r--r--   1 hadoop supergroup  129213898 2012-09-12 15:47 
> /home/test/result/chunk-0
>
> And after "mahout seq2sparse -i /home/test/result -o /home/test/sparse":
>
> [hadoop@datamining ~]$ hadoop fs -ls /home/test/sparse
> Found 7 items
> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
> /home/test/sparse/df-count
> -rw-r--r--   1 hadoop supergroup     442252 2012-09-12 15:53 
> /home/test/sparse/dictionary.file-0
> -rw-r--r--   1 hadoop supergroup     394853 2012-09-12 15:54 
> /home/test/sparse/frequency.file-0
> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
> /home/test/sparse/tf-vectors
> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
> /home/test/sparse/tfidf-vectors
> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
> /home/test/sparse/tokenized-documents
> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
> /home/test/sparse/wordcount
>
> Which should I do next? I used "mahout kmeans -i /home/test/sparse/ -o 
> /home/test/kmeans -dm 
> org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 
> -ow --clustering"
> but I got error.....
>
> Thx!
>
>
> On 09/12/2012 03:24 PM, Paritosh Ranjan wrote:
>> I think you will need these two commands ( in the same order ) :
>>
>> seqdirectory : Generate sequence files (of Text) from a directory
>> seq2sparse: Sparse Vector generation from Text sequence files
>>
>> On 12-09-2012 12:28, Don Tan wrote:
>>> I think I didn't explain clear enough and sorry for that.
>>>
>>> The example showed before is a part of my data.
>>>
>>> Each line is a user profile, for example, the first row is the 
>>> features of
>>> a user. And I want to apply k-means to this data.
>>>
>>> I need to create a file saves all users profile as sparse vector and 
>>> put
>>> them in mahout k-means algorithm, how can I do that?
>>>
>>>   Thanks for your advice!
>>>
>>> Don Tan
>>>
>>> 2012/9/12 Paritosh Ranjan <pranjan@xebia.com>
>>>
>>>> I could not understand the question correctly, can you explain more?
>>>> Here you can find how to use kmeans algorithm of Mahout
>>>> https://cwiki.apache.org/**confluence/display/MAHOUT/K-**Means+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering>

>>>>
>>>> .
>>>>
>>>>
>>>> On 12-09-2012 11:43, Don.Tan wrote:
>>>>
>>>>> Aloha!
>>>>>
>>>>>     I am new to hadoop and mahout, but I have set up the hadoop 
>>>>> cluster.
>>>>>
>>>>>     I am working on a clustering task lately. I think I could not 
>>>>> make it
>>>>> quickly because I don't know too much about how to deal with 
>>>>> massive data (
>>>>> my data contains 1400000 user and 50000 features..plus that is 
>>>>> sparse ).
>>>>>
>>>>>     Could you tell me how deal with that? A slice of data is here:
>>>>>
>>>>> 167555,152622,162252,79481,**66540,41942,75500,167898,**
>>>>> 61923,182083,180681,181135,**174449,166439,167307,174126,**87800,2826,

>>>>>
>>>>>      98660,158620,33900,
>>>>> 4780,13922,45040,159210,26423,**1471,68200,70402,109721,**
>>>>> 145860,23740,5818,15087,47861,**158620,170482,170161,39120,**
>>>>> 164514,5854,169183,151229,**171110,163457,4356,21363,1307,**78105,1322,177011,167822,

>>>>>
>>>>>
>>>>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>>>>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>>>>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>>>>
>>>>>      example above contains 4 user's data and each number is nominal
>>>>> (denoting that is a kind of behavior of user, e.s, user 2 has
>>>>> "98660","158620","33900" )
>>>>>
>>>>>      Please tell me how to work on that or which documents should 
>>>>> I read..
>>>>>
>>>>>
>>>>>      Thx!
>>>>>
>>>>>     Don Tan
>>>>>
>>>>
>>>>
>>
>>
>



Mime
View raw message