thanks for the insights Ted
On 9 Mar 2013, at 18:40, Ted Dunning <ted.dunning@gmail.com> wrote:
> SVD techniques probably won't actually help that much given your current
> sparsity. There are two issues:
>
> first, your data is already quite small. SVD will only make it larger
> because the average number of nonzero elements will increase dramatically.
>
> second, given your sparsity, SVD will have very little to work with. Very
> sparse data elements are inherently nearly orthogonal.
>
> I think you need to find more features so that your average number of
> nonzeros goes up.
>
> On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <columfoley@gmail.com> wrote:
>
>> Thanks a lot Ted. I think there's some preprocessing I can do to remove
>> some outliers which may reduce my matrix size considerably.ill also check
>> out some SVD techniques
>> On 9 Mar 2013 17:16, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>>
>>> The new streaming kmeans should be able to handle that data pretty
>>> efficiently. My guess is that on a single 16 core machine if should be
>>> able to complete the clustering in 10 minutes or so. That is
>> extrapolation
>>> and thus could be wildly off, of course.
>>>
>>> You definitely mean sparse. 30 M / 20 M = 1.5 nonzero features per row.
>>> That may be a problem. Or it might make the clustering fairly trivial.
>>>
>>> Dan,
>>>
>>> That code isn't checked into trunk yet, but I think. Can you comment on
>>> where working code can be found on github?
>>>
>>> On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <columfoley@gmail.com>
>> wrote:
>>>
>>>> I have approximately 20million items and a feature vector of approx 30
>>>> million in length,very sparse.
>>>>
>>>> Would you have any suggestions for other clustering algorithms I should
>>>> look at ?
>>>>
>>>> Thanks,
>>>> Colum
>>>>
>>>> On 8 Mar 2013, at 22:51, Ted Dunning <ted.dunning@gmail.com> wrote:
>>>>
>>>>> You are beginning to exit the realm of reasonable applicability for
>>>> normal
>>>>> kmeans algorithms here.
>>>>>
>>>>> How much data do you have?
>>>>>
>>>>> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <columfoley@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> When I run KMeans clustering on a cluster, i notice that when I have
>>>>>> "large" values for k (i.e approx >1000) I get loads of hadoop
write
>>>>>> errors:
>>>>>>
>>>>>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>>>>>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>>>>>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel
>>>>>>
>>>>>> This continues indefinitely and lots of part0xxxxx files are
>> produced
>>>>>> of sizes of around 30kbs.
>>>>>>
>>>>>> If I reduce the value for k it runs fine. Furthermore If I run it
in
>>>>>> local mode with high values of k it runs fine.
>>>>>>
>>>>>> The command I am using is as follows:
>>>>>>
>>>>>> mahout kmeans i FeatureVectorsMahoutFormat o ClusterResults
>>>>>> clusters tmp dm
>>>>>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
>> cd
>>>>>> 1.0 x 20 cl k 10000
>>>>>>
>>>>>> I am running mahout 0.7.
>>>>>>
>>>>>> Are there some performance parameters I need to tune for mahout when
>>>>>> dealing with large volumes of data?
>>>>>>
>>>>>> Thanks,
>>>>>> Colum
>>
