mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Don.Tan" <tanb...@gmail.com>
Subject Re: How to use kmeans clustering algorithm of Mahout
Date Thu, 13 Sep 2012 02:59:38 GMT
I have tried it by following the way of the sample code, and I noticed 
that I should not use seq2sparse directory. That leads to the sparse 
result is empty.... Anyone you known could help me deal with that?

On 09/12/2012 07:09 PM, Paritosh Ranjan wrote:
> I think it shouldn't be sparse in the beginning, the seq2sparse should 
> take care of it.
> Some one will correct me if I would be wrong, so, wait for some time 
> and then go ahead.
>
> On 12-09-2012 16:07, Don.Tan wrote:
>> Thank you for you promptly reply. Can I ask a question before I go on?
>>
>>      My original data is in a format like that:
>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>
>> which is in a sparse format. Is that correct to use seqdirectory and 
>> seq2sparse directly?
>>
>>
>> On 09/12/2012 06:30 PM, Paritosh Ranjan wrote:
>>> Also try to follow the steps in cluster-reuters.sh file. This might 
>>> help.
>>>
>>> On 12-09-2012 15:59, Paritosh Ranjan wrote:
>>>> Can you explain something about the error and provide the stacktrace ?
>>>>
>>>> On 12-09-2012 14:22, Don.Tan wrote:
>>>>> The original data is here:
>>>>>
>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/test
>>>>> Found 1 items
>>>>> -rw-r--r--   1 hadoop supergroup  129213799 2012-09-12 15:45 
>>>>> /home/test/test/result
>>>>>
>>>>> After I used "mahout seqdirectory -i /home/test/test/ -o 
>>>>> /home/test/result/ -c UTF-8", get this:
>>>>>
>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/result
>>>>> Found 1 items
>>>>> -rw-r--r--   1 hadoop supergroup  129213898 2012-09-12 15:47 
>>>>> /home/test/result/chunk-0
>>>>>
>>>>> And after "mahout seq2sparse -i /home/test/result -o 
>>>>> /home/test/sparse":
>>>>>
>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/sparse
>>>>> Found 7 items
>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
>>>>> /home/test/sparse/df-count
>>>>> -rw-r--r--   1 hadoop supergroup     442252 2012-09-12 15:53 
>>>>> /home/test/sparse/dictionary.file-0
>>>>> -rw-r--r--   1 hadoop supergroup     394853 2012-09-12 15:54 
>>>>> /home/test/sparse/frequency.file-0
>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>> /home/test/sparse/tf-vectors
>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
>>>>> /home/test/sparse/tfidf-vectors
>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>> /home/test/sparse/tokenized-documents
>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>> /home/test/sparse/wordcount
>>>>>
>>>>> Which should I do next? I used "mahout kmeans -i 
>>>>> /home/test/sparse/ -o /home/test/kmeans -dm 
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 
>>>>> 20 -ow --clustering"
>>>>> but I got error.....
>>>>>
>>>>> Thx!
>>>>>
>>>>>
>>>>> On 09/12/2012 03:24 PM, Paritosh Ranjan wrote:
>>>>>> I think you will need these two commands ( in the same order ) :
>>>>>>
>>>>>> seqdirectory : Generate sequence files (of Text) from a directory
>>>>>> seq2sparse: Sparse Vector generation from Text sequence files
>>>>>>
>>>>>> On 12-09-2012 12:28, Don Tan wrote:
>>>>>>> I think I didn't explain clear enough and sorry for that.
>>>>>>>
>>>>>>> The example showed before is a part of my data.
>>>>>>>
>>>>>>> Each line is a user profile, for example, the first row is the

>>>>>>> features of
>>>>>>> a user. And I want to apply k-means to this data.
>>>>>>>
>>>>>>> I need to create a file saves all users profile as sparse vector

>>>>>>> and put
>>>>>>> them in mahout k-means algorithm, how can I do that?
>>>>>>>
>>>>>>>   Thanks for your advice!
>>>>>>>
>>>>>>> Don Tan
>>>>>>>
>>>>>>> 2012/9/12 Paritosh Ranjan <pranjan@xebia.com>
>>>>>>>
>>>>>>>> I could not understand the question correctly, can you explain

>>>>>>>> more?
>>>>>>>> Here you can find how to use kmeans algorithm of Mahout
>>>>>>>> https://cwiki.apache.org/**confluence/display/MAHOUT/K-**Means+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering>

>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12-09-2012 11:43, Don.Tan wrote:
>>>>>>>>
>>>>>>>>> Aloha!
>>>>>>>>>
>>>>>>>>>     I am new to hadoop and mahout, but I have set up
the 
>>>>>>>>> hadoop cluster.
>>>>>>>>>
>>>>>>>>>     I am working on a clustering task lately. I think
I could 
>>>>>>>>> not make it
>>>>>>>>> quickly because I don't know too much about how to deal
with 
>>>>>>>>> massive data (
>>>>>>>>> my data contains 1400000 user and 50000 features..plus
that is 
>>>>>>>>> sparse ).
>>>>>>>>>
>>>>>>>>>     Could you tell me how deal with that? A slice of
data is 
>>>>>>>>> here:
>>>>>>>>>
>>>>>>>>> 167555,152622,162252,79481,**66540,41942,75500,167898,**
>>>>>>>>> 61923,182083,180681,181135,**174449,166439,167307,174126,**87800,2826,

>>>>>>>>>
>>>>>>>>>      98660,158620,33900,
>>>>>>>>> 4780,13922,45040,159210,26423,**1471,68200,70402,109721,**
>>>>>>>>> 145860,23740,5818,15087,47861,**158620,170482,170161,39120,**
>>>>>>>>> 164514,5854,169183,151229,**171110,163457,4356,21363,1307,**78105,1322,177011,167822,

>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>>>>>>>>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>>>>>>>>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>>>>>>>>
>>>>>>>>>      example above contains 4 user's data and each number
is 
>>>>>>>>> nominal
>>>>>>>>> (denoting that is a kind of behavior of user, e.s, user
2 has
>>>>>>>>> "98660","158620","33900" )
>>>>>>>>>
>>>>>>>>>      Please tell me how to work on that or which documents

>>>>>>>>> should I read..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Thx!
>>>>>>>>>
>>>>>>>>>     Don Tan
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>


Mime
View raw message