mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: AW: Incremental clustering
Date Thu, 12 May 2011 16:40:14 GMT
Sure,
Each iteration of the kmeans, fuzzyK & Dirichlet clustering algorithms begin with an initial
(prior) set of clusters (a.k.a. models). Each iteration assigns each input vector to one (kmeans
= most likely; Dirichlet = multinomial sampling) or multiple (fuzzyK = percentage of each)
clusters. Then, at the end of the iteration, each cluster's parameters are recomputed based
upon the observed data and the posterior clusters from iteration n become the prior clusters
for iteration n+1.

Based upon discussions with Ted, I've been trying to recast clustering in terms of an unsupervised
classification problem. This is most obvious is you look at the new ClusterClassifier &
ClusterIterator, which implement all three algorithms in a single classification-ready engine.
ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. This means
a ClusterClassifier produced by unsupervised training with some data can be used as a model
in a semi-supervised classifier along with models obtained via supervised training.

I've adjusted the 3 Display clustering examples to use the ClusterClassifier so you can see
that it works pretty well. I'm particularly pleased with how Dirichlet and Kmeans fit together
using this approach.

-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com] 
Sent: Thursday, May 12, 2011 9:14 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

Jeff,

Could you expand a bit on the subject of models in clustering? I
mentally simplify this into 'clustering: unsupervised; classification:
supervised.'

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?

--benson


On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <jeastman@narus.com> wrote:
> Sure, by using your old clusters as the prior (clustersIn) for the new clustering, you
can reduce the number of iterations required to converge.
>
> -----Original Message-----
> From: David Saile [mailto:david@uni-koblenz.de]
> Sent: Thursday, May 12, 2011 8:54 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
>
> Thank you very much everyone! This really helped a lot.
>
> Here is what I am planning to do:
> I am going to compute an initial clustering after the first crawl.
> Then, as sites are being added to the index I will simply classify them using the existing
clusters.
>
> As I expect updates to be generally very small, I will only recompute the clustering
after some threshold has been hit, like Grant suggested.
> As Ted pointed out, this can be done with the old clusters as input.
>
> Thanks again,
> David
>
>
>
> Am 12.05.2011 um 17:35 schrieb Ted Dunning:
>
>> Most of these algorithms can be done in an incremental fashion in which you
>> can add batches to the previous training.
>>
>> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <jeastman@narus.com> wrote:
>>
>>> Most of the clustering drivers have two methods: one to train the clusterer
>>> with data to produce the cluster models; one to classify the data using a
>>> given set of cluster models. Currently the CLI only allows train followed by
>>> optional classify. We could pretty easily allow classify to be done
>>> stand-alone, and this would be useful in support of Grant's approach below.
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>>> Sent: Thursday, May 12, 2011 3:32 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: AW: Incremental clustering
>>>
>>> From what I've seen, using Mahout's existing clustering methods, I think
>>> most people setup some schedule whereby they cluster the whole collection on
>>> a regular basis and then all docs that come in the meantime are simply
>>> assigned to the closest cluster until the next whole collection iteration is
>>> completed.  There are, of course, other variants one could do, such as kick
>>> off the whole clustering when some threshold of number of docs is reached.
>>>
>>> There are other clustering methods, as Benson alluded to, that may better
>>> support incremental approaches.
>>>
>>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>>>
>>>> I am still stuck at this problem.
>>>>
>>>> Can anyone give me a heads-up on how existing systems handle this?
>>>> If a collection of documents is modified, is the clustering recomputed
>>> from scratch each time?
>>>> Or is there in fact any incremental way to handle an evolving set of
>>> documents?
>>>>
>>>> I would really appreciate any hint!
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>>>
>>>>> Not an answer, but a follow-up question:
>>>>> I would be interested in the very same thing, but with the possibility
>>> to assign new sites to existing clusters OR to new ones.
>>>>>
>>>>> Thanks in advance,
>>>>> Ulrich
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>>> An: user@mahout.apache.org
>>>>> Betreff: Incremental clustering
>>>>>
>>>>> Hi list,
>>>>>
>>>>> I am completely new to Mahout, so please forgive me if the answer to
my
>>> question is too obvious.
>>>>>
>>>>> For a case study, I am working on a simple incremental web crawler (much
>>> like Nutch) and I want to include a very simple indexing step that
>>> incorporates clustering of documents.
>>>>>
>>>>> I was hoping to use some kind of incremental clustering algorithm, in
>>> order to make use of the incremental way the crawler is supposed to work
>>> (i.e. continuously adding and updating websites).
>>>>>
>>>>> Is there some way to achieve the following:
>>>>>     1) initial clustering of the first web-crawl
>>>>>     2) assigning new sites to existing clusters
>>>>>     3) possibly moving modified sites between clusters
>>>>>
>>>>> I would really appreciate any help!
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>
Mime
View raw message