mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Filimon <>
Subject Re: How to improve clustering?
Date Wed, 27 Mar 2013 05:43:31 GMT
Ah, so Ted, it looks like there's a bug with the mapreduce after all then.

Pity, I liked the higher dimensionality argument but thinking it through, it doesn't make
that much sense.

On Mar 27, 2013, at 6:52, Ted Dunning <> wrote:

> Reducing to a lower dimensional space is a convenience, no more.
> Clustering in the original space is fine.  I still have trouble with your
> normalizing before weighting, but I don't know what effect that will have
> on anything.  It certainly will interfere with the interpretation of the
> cosine metrics.
> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
>> wrote:
>> I am not quite sure whether this will solve the problem, though I will try
>> it of course.
>> I always thought that clustering documents based on their words is a
>> common problem and is usually tackled in the word space and not in a
>> reduced one.
>> Besides the distances look reasonable. Still I end up with very similar
>> and very distant documents unclustered in the middle of all clusters.
>> So I think the problem lies in the clustering method not in the distances.
>> Dan Filimon <> schrieb:
>>> So you're clustering 90K dimensional data?
>>> I'm faced with a very similar problem as you (working on
>>> StreamingKMeans for mahout) and from what I read [1], the problem
>>> might be that in very high dimensional spaces the distances become
>>> meaningless.
>>> I'm pretty sure this is the case and I was considering implementing
>>> the test mentioned in the paper (also I feel like it's a very useful
>>> algorithm to have).
>>> In any case, since the vectors are so sparse, why not reduce their
>>> dimension?
>>> You can try principal component analysis (just getting the first k
>>> eigenvectors in the singular value decomposition of the matrix that
>>> has your vectors as rows). The class that does this is SSVDSolver
>>> (there's also SingularValueDecomposition but that tries making dense
>>> matrices and those might not fit into memory. I've never personally
>>> used it though.
>>> Once you have the first k eigenvectors of size n, make them rows in a
>>> matrix (U) and multiply each vector x you have with it (U x) getting a
>>> reduced vector.
>>> Or, use random projections to reduce the size of the data set. You
>>> want to create a matrix whose entries are sampled from a uniform
>>> distribution (0, 1) (Functions.random in
>>> o.a.m.math.function.Functions), normalize its rows and multiply each
>>> vector x with it.
>>> So, reduce the size of your vectors thereby making the dimensionality
>>> less of a problem and you'll get a decent approximation (you can
>>> actually quantify how good it is with SVD). From what I've seen, the
>>> clusters separate at smaller dimensions but there's the question of
>>> how good an approximation of the uncompressed data you have.
>>> See if this helps, I need to do the same thing :)
>>> What do you think?
>>> [1]
>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>> <> wrote:
>>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>>> words. However, each document contains usually only about 10 to 20
>>>> words. Only some contain more than 1000 words.
>>>> For each document, I set a field in the corresponding vector to 1 if
>>> it
>>>> contains a word. Then I normalize each vector using the L2-norm.
>>>> Finally I multiply each element (representing a word) in the vector
>>> by
>>>> log(#documents/#documents_with_word).
>>>> For clustering, I am using cosine similarity.
>>>> Regards
>>>> Sebastian
>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>> Hi,
>>>>> Could you tell us more about the kind of data you're clustering?
>>> What
>>>>> distance measure you're using and what the dimensionality of the
>>> data
>>>>> is?
>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>> <> wrote:
>>>>>> Dear Mahout-users,
>>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>> c
>>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>> 1.) I always end up with one large set of rubbish instances. All
>>> them
>>>>>> have uniform cluster probability distribution and are, hence, in
>>> the
>>>>>> exact middle of the cluster space.
>>>>>> The cosine distance between instances within this cluster reaches
>>> from 0
>>>>>> to 1.
>>>>>> 2.) Some of my clusters have the same or a very very similar
>>> center.
>>>>>> Besides the above described problems, the clustering seems to work
>>> fine.
>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>> Regards
>>>>>> Sebastian
>> --
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>> gesendet.

View raw message