mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colum Foley <columfo...@gmail.com>
Subject Re: KMeans Throwing Hadoop write errors for large values of K
Date Sat, 09 Mar 2013 20:56:53 GMT
thanks for the insights Ted

On 9 Mar 2013, at 18:40, Ted Dunning <ted.dunning@gmail.com> wrote:

> SVD techniques probably won't actually help that much given your current
> sparsity.  There are two issues:
> 
> first, your data is already quite small.  SVD will only make it larger
> because the average number of non-zero elements will increase dramatically.
> 
> second, given your sparsity, SVD will have very little to work with.  Very
> sparse data elements are inherently nearly orthogonal.
> 
> I think you need to find more features so that your average number of
> non-zeros goes up.
> 
> On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <columfoley@gmail.com> wrote:
> 
>> Thanks a lot Ted. I think there's some preprocessing I can do to remove
>> some outliers which may reduce my matrix size considerably.ill also check
>> out some SVD techniques
>> On 9 Mar 2013 17:16, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>> 
>>> The new streaming k-means should be able to handle that data pretty
>>> efficiently.  My guess is that on a single 16 core machine if should be
>>> able to complete the clustering in 10 minutes or so.  That is
>> extrapolation
>>> and thus could be wildly off, of course.
>>> 
>>> You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
>>> That may be a problem.  Or it might make the clustering fairly trivial.
>>> 
>>> Dan,
>>> 
>>> That code isn't checked into trunk yet, but I think.   Can you comment on
>>> where working code can be found on github?
>>> 
>>> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <columfoley@gmail.com>
>> wrote:
>>> 
>>>> I have approximately 20million items and a feature vector of approx 30
>>>> million in length,very sparse.
>>>> 
>>>> Would you have any suggestions for other clustering algorithms I should
>>>> look at ?
>>>> 
>>>> Thanks,
>>>> Colum
>>>> 
>>>> On 8 Mar 2013, at 22:51, Ted Dunning <ted.dunning@gmail.com> wrote:
>>>> 
>>>>> You are beginning to exit the realm of reasonable applicability for
>>>> normal
>>>>> k-means algorithms here.
>>>>> 
>>>>> How much data do you have?
>>>>> 
>>>>> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <columfoley@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> When I run KMeans clustering on a cluster, i notice that when I have
>>>>>> "large" values for k (i.e approx >1000) I get loads of hadoop
write
>>>>>> errors:
>>>>>> 
>>>>>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>>>>>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>>>>>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel
>>>>>> 
>>>>>> This continues indefinitely and lots of part-0xxxxx files are
>> produced
>>>>>> of sizes of around 30kbs.
>>>>>> 
>>>>>> If I reduce the value for k it runs fine. Furthermore If I run it
in
>>>>>> local mode with high values of k it runs fine.
>>>>>> 
>>>>>> The command I am using is as follows:
>>>>>> 
>>>>>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
>>>>>> --clusters tmp -dm
>>>>>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
>> -cd
>>>>>> 1.0 -x 20 -cl -k 10000
>>>>>> 
>>>>>> I am running mahout 0.7.
>>>>>> 
>>>>>> Are there some performance parameters I need to tune for mahout when
>>>>>> dealing with large volumes of data?
>>>>>> 
>>>>>> Thanks,
>>>>>> Colum
>> 

Mime
View raw message