mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 18:38:17 GMT
I think the bigger issue here is we are doing extra work to calculate  
distance.  I'd suggest hanging on a few days to see if we can get that  
straightened out.

On Jul 27, 2009, at 2:33 PM, nfantone wrote:

>> Well, it does matter to some degree since picking random vectors  
>> tends to give you dense vectors whereas text gives you very sparse  
>> vectors.
>> Different patterns of sparsity can cause radically different time  
>> complexity
> for the clustering.
> I have yet to find a random combination of vectors that actually
> benefits substantially the performance of kMeans. I have also tried
> real datasets (like the one I was initially using from large amounts
> of data defining consumer's buying habits) to no avail. How should a
> collection of vectors be created to, say, not compromise the algorithm
> functionality significantly?

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

View raw message