Also, the knn package has a single pass kmeans implementation that
can easily handle 20,000 clusters or more. This is done by using an
approximate nearest neighbor algorithm inside the kmeans
implementation to decrease the time dependency on k to roughly log k.
See http://github.com/tdunning/knn
Any help in testing these new capabilities or plumbing them into the
standard Mahout capabilities would be very much appreciated.
On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The upcoming knn package has a file based matrix implementation that uses memory mapping
to allow sharing a copy of a large matrix between processes and threads.
>
> Sent from my iPhone
>
> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <p.abramov@ramblerco.ru> wrote:
>
>> Hello,
>>
>> If think Zipf's law is relevant for my data. Thanks for idea about
>> memorymapping.
>>
>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>> clusters with only 1 member while 1 large cluster has 50% members. Is
>> there a way to "split" large clusters with Kmeans?
>>
>> 2) Can I force Mahout not to use exact centroid but the closest point from
>> my set? Any point has ~10 nonzero components while exact centroid is very
>> dense (~200k).
>>
>>
>> Thanks!
>>
>> Pavel
>>
>>
>> 09.08.12 5:43 пользователь "Lance Norskog" <goksron@gmail.com>
написал:
>>
>>> If Zipf's Law is relevant, locality will be much better than random.
>>> Maybe you need a Vector implementation that is backed by memorymapped
>>> files?
>>>
>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <p.abramov@ramblerco.ru>
>>> wrote:
>>>> Thank you Jeff, Paritosh,
>>>>
>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>> Canopy for initial cluster count.
>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>> no
>>>> unfrequent terms.
>>>>
>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>> limits. It will affect the whole cluster.
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 08.08.12 16:15 пользователь "Jeff Eastman" <jdog@windwardsolutions.com>
>>>> написал:
>>>>
>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>> and reducer, and that these vectors tend to become more dense (through
>>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>>> around this problem by increasing your heap size on a perjob basis or
>>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>> have completed.
>>>>>
>>>>> In order to provide more help to you, it would be useful to understand
>>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>> each node, etc.
>>>>>
>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>>> Canopy is also a good way to get a feel for a good initial k, though
it
>>>>> can be hard to arrive at good Tvalues in some text clustering cases.
>>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>>> your memory limits.
>>>>>
>>>>>
>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>
>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>> vector dimension is also high ).
>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
)
>>>>>>
>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>
>>>>>> PS : you are also providing numClusters twice
>>>>>>
>>>>>> numClusters 1000 \ numClusters 5 \
>>>>>>
>>>>>> On 08082012 10:42, Abramov Pavel wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>>> output).
>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 310 terms
>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>
>>>>>>> My job failed with Java heap space Error. 1st iteration passes
while
>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a
lot of
>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails
with
>>>>>>> "Java Heap space".
>>>>>>>
>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster
is
>>>>>>> tunned well.
>>>>>>>
>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and
220
>>>>>>> Reducers running 40 servers 8core 12 GB RAM.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Here is KMeans parameters:
>>>>>>>
>>>>>>> 
>>>>>>> mahout kmeans Dmapred.reduce.tasks=200 \
>>>>>>> i ...tfidfvectors/ \
>>>>>>> o /tmp/clustering_results_kmeans/ \
>>>>>>> clusters /tmp/clusters/ \
>>>>>>> numClusters 1000 \
>>>>>>> numClusters 5 \
>>>>>>> overwrite \
>>>>>>> clustering
>>>>>>> 
>>>>>>>
>>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> 
>>> Lance Norskog
>>> goksron@gmail.com
>>
