mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Date Wed, 29 Mar 2017 16:10:57 GMT
Sorry, i think more commonly if aggregating transpose is to be used, then
cenroid assignments are better be the key of the matrix D (so D:= A) and
aggregating transpose is performed on a matrix (1 | D)'  (i.e., 1 cbind
D).t  so that the first row of result contains counts of cluster points and
we can finish up cluster assignment via

M = (1 | D)'
C = M(:,2:) with each row hadamard-divided by first row of counts M(:,1)
(implying Golub-Van Loan notations for subblocking)

On Wed, Mar 29, 2017 at 9:02 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> the simplest scheme is to initialize distributed matrix of the shape D :=
> (0 | A) where A is your dataset and 0 is a single column indicating current
> centroid assignment and distribute current centroid matrix C via matrix
> broadcast (assuming there are few enough centers).
>
> Then alternatively run cluster assignment within mapBlock() operator on D
> with recomputation of new centroids C afterwards. Recomputation of
> centroids can be done via aggregating transpose.
>
> of course a better scheme includes pre-sketching (k-means ||) and use of a
> triangle inequality during recomputations.
>
> On Wed, Mar 29, 2017 at 8:30 AM, KHATWANI PARTH BHARAT <
> h2016170@pilani.bits-pilani.ac.in> wrote:
>
>> Sir,
>> I am trying to write the kmeans clustering algorithm using Mahout Samsara
>> but i am bit confused
>> about how to leverage Distributed Row Matrix for the same. Can anybody
>> help
>> me with same.
>>
>>
>>
>>
>>
>> Thanks
>> Parth Khatwani
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message