mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: AW: Incremental clustering
Date Thu, 12 May 2011 10:32:06 GMT
From what I've seen, using Mahout's existing clustering methods, I think most people setup
some schedule whereby they cluster the whole collection on a regular basis and then all docs
that come in the meantime are simply assigned to the closest cluster until the next whole
collection iteration is completed.  There are, of course, other variants one could do, such
as kick off the whole clustering when some threshold of number of docs is reached.

There are other clustering methods, as Benson alluded to, that may better support incremental
approaches.

On May 12, 2011, at 4:53 AM, David Saile wrote:

> I am still stuck at this problem.
> 
> Can anyone give me a heads-up on how existing systems handle this? 
> If a collection of documents is modified, is the clustering recomputed from scratch each
time? 
> Or is there in fact any incremental way to handle an evolving set of documents?
> 
> I would really appreciate any hint!
> 
> Thanks,
> David
> 
> 
> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> 
>> Not an answer, but a follow-up question: 
>> I would be interested in the very same thing, but with the possibility to assign
new sites to existing clusters OR to new ones.
>> 
>> Thanks in advance,
>> Ulrich
>> 
>> -----Urspr√ľngliche Nachricht-----
>> Von: David Saile [mailto:david@uni-koblenz.de] 
>> Gesendet: Montag, 9. Mai 2011 11:53
>> An: user@mahout.apache.org
>> Betreff: Incremental clustering
>> 
>> Hi list,
>> 
>> I am completely new to Mahout, so please forgive me if the answer to my question
is too obvious.
>> 
>> For a case study, I am working on a simple incremental web crawler (much like Nutch)
and I want to include a very simple indexing step that incorporates clustering of documents.
>> 
>> I was hoping to use some kind of incremental clustering algorithm, in order to make
use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating
websites).
>> 
>> Is there some way to achieve the following: 	
>> 	1) initial clustering of the first web-crawl
>> 	2) assigning new sites to existing clusters
>> 	3) possibly moving modified sites between clusters
>> 
>> I would really appreciate any help!
>> 
>> Thanks,
>> David
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message