mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Thu, 23 Jul 2009 16:50:33 GMT

On Jul 23, 2009, at 10:20 AM, nfantone wrote:

>> That does seem like a long time.
>>
>> Is your data sparse or dense?
>
> I would say sparse. My vectors are high dimensional and most of their
> values are zero.
>
>> Perhaps a larger convergence value might help (-d, I believe).
>
> I'll try that.
>
>> Is there any chance your data is publicly shareable?  Come to think  
>> of it,
>> with the vector representations, as long as you don't publish the  
>> key (which
>> terms map to which index), I would think most all data is publicly
>> shareable.
>
> I'm sorry, I don't quite understand what you're asking. Publicly
> shareable? As in user-permissions to access/read/write the data?

As in post a copy of the SequenceFile somewhere for download, assuming  
you can.  Then others could presumably try it out.


>
>> Are you on trunk of Mahout?  I think we still need more profiling  
>> to get a
>> better idea of where improvements can be made.
>
> I am. Updated this morning.
>
> I still insist on the configuration issue, and have never considered
> Mahout's algorithms implementation to be the actual cause of poor
> performance. For now, I've been running kMeans exclusively. Perhaps, I
> should try with different clustering methods and see if it takes a
> similar amount of time to complete.

Well KMeans actually runs two algorithms normally: canopy and then  
KMeans.  You could try the Random seed approach, which would skip the  
canopy run first.

Mime
View raw message