mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject RE: AW: Incremental clustering
Date Thu, 12 May 2011 16:40:14 GMT
Each iteration of the kmeans, fuzzyK & Dirichlet clustering algorithms begin with an initial
(prior) set of clusters (a.k.a. models). Each iteration assigns each input vector to one (kmeans
= most likely; Dirichlet = multinomial sampling) or multiple (fuzzyK = percentage of each)
clusters. Then, at the end of the iteration, each cluster's parameters are recomputed based
upon the observed data and the posterior clusters from iteration n become the prior clusters
for iteration n+1.

Based upon discussions with Ted, I've been trying to recast clustering in terms of an unsupervised
classification problem. This is most obvious is you look at the new ClusterClassifier &
ClusterIterator, which implement all three algorithms in a single classification-ready engine.
ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. This means
a ClusterClassifier produced by unsupervised training with some data can be used as a model
in a semi-supervised classifier along with models obtained via supervised training.

I've adjusted the 3 Display clustering examples to use the ClusterClassifier so you can see
that it works pretty well. I'm particularly pleased with how Dirichlet and Kmeans fit together
using this approach.

-----Original Message-----
From: Benson Margulies [] 
Sent: Thursday, May 12, 2011 9:14 AM
Subject: Re: AW: Incremental clustering


Could you expand a bit on the subject of models in clustering? I
mentally simplify this into 'clustering: unsupervised; classification:

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?


On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <> wrote:
> Sure, by using your old clusters as the prior (clustersIn) for the new clustering, you
can reduce the number of iterations required to converge.
> -----Original Message-----
> From: David Saile []
> Sent: Thursday, May 12, 2011 8:54 AM
> To:
> Subject: Re: AW: Incremental clustering
> Thank you very much everyone! This really helped a lot.
> Here is what I am planning to do:
> I am going to compute an initial clustering after the first crawl.
> Then, as sites are being added to the index I will simply classify them using the existing
> As I expect updates to be generally very small, I will only recompute the clustering
after some threshold has been hit, like Grant suggested.
> As Ted pointed out, this can be done with the old clusters as input.
> Thanks again,
> David
> Am 12.05.2011 um 17:35 schrieb Ted Dunning:
>> Most of these algorithms can be done in an incremental fashion in which you
>> can add batches to the previous training.
>> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <> wrote:
>>> Most of the clustering drivers have two methods: one to train the clusterer
>>> with data to produce the cluster models; one to classify the data using a
>>> given set of cluster models. Currently the CLI only allows train followed by
>>> optional classify. We could pretty easily allow classify to be done
>>> stand-alone, and this would be useful in support of Grant's approach below.
>>> Jeff
>>> -----Original Message-----
>>> From: Grant Ingersoll []
>>> Sent: Thursday, May 12, 2011 3:32 AM
>>> To:
>>> Subject: Re: AW: Incremental clustering
>>> From what I've seen, using Mahout's existing clustering methods, I think
>>> most people setup some schedule whereby they cluster the whole collection on
>>> a regular basis and then all docs that come in the meantime are simply
>>> assigned to the closest cluster until the next whole collection iteration is
>>> completed.  There are, of course, other variants one could do, such as kick
>>> off the whole clustering when some threshold of number of docs is reached.
>>> There are other clustering methods, as Benson alluded to, that may better
>>> support incremental approaches.
>>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>>>> I am still stuck at this problem.
>>>> Can anyone give me a heads-up on how existing systems handle this?
>>>> If a collection of documents is modified, is the clustering recomputed
>>> from scratch each time?
>>>> Or is there in fact any incremental way to handle an evolving set of
>>> documents?
>>>> I would really appreciate any hint!
>>>> Thanks,
>>>> David
>>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>>>> Not an answer, but a follow-up question:
>>>>> I would be interested in the very same thing, but with the possibility
>>> to assign new sites to existing clusters OR to new ones.
>>>>> Thanks in advance,
>>>>> Ulrich
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: David Saile []
>>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>>> An:
>>>>> Betreff: Incremental clustering
>>>>> Hi list,
>>>>> I am completely new to Mahout, so please forgive me if the answer to
>>> question is too obvious.
>>>>> For a case study, I am working on a simple incremental web crawler (much
>>> like Nutch) and I want to include a very simple indexing step that
>>> incorporates clustering of documents.
>>>>> I was hoping to use some kind of incremental clustering algorithm, in
>>> order to make use of the incremental way the crawler is supposed to work
>>> (i.e. continuously adding and updating websites).
>>>>> Is there some way to achieve the following:
>>>>>     1) initial clustering of the first web-crawl
>>>>>     2) assigning new sites to existing clusters
>>>>>     3) possibly moving modified sites between clusters
>>>>> I would really appreciate any help!
>>>>> Thanks,
>>>>> David
>>> --------------------------
>>> Grant Ingersoll
>>> Search the Lucene ecosystem docs using Solr/Lucene:
View raw message