Ah, so Ted, it looks like there's a bug with the mapreduce after all then.
Pity, I liked the higher dimensionality argument but thinking it through, it doesn't make
that much sense.
On Mar 27, 2013, at 6:52, Ted Dunning <ted.dunning@gmail.com> wrote:
> Reducing to a lower dimensional space is a convenience, no more.
>
> Clustering in the original space is fine. I still have trouble with your
> normalizing before weighting, but I don't know what effect that will have
> on anything. It certainly will interfere with the interpretation of the
> cosine metrics.
>
> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> sebastian.briesemeister@unister.de> wrote:
>
>> I am not quite sure whether this will solve the problem, though I will try
>> it of course.
>>
>> I always thought that clustering documents based on their words is a
>> common problem and is usually tackled in the word space and not in a
>> reduced one.
>> Besides the distances look reasonable. Still I end up with very similar
>> and very distant documents unclustered in the middle of all clusters.
>>
>> So I think the problem lies in the clustering method not in the distances.
>>
>>
>>
>> Dan Filimon <dangeorge.filimon@gmail.com> schrieb:
>>
>>> So you're clustering 90K dimensional data?
>>>
>>> I'm faced with a very similar problem as you (working on
>>> StreamingKMeans for mahout) and from what I read [1], the problem
>>> might be that in very high dimensional spaces the distances become
>>> meaningless.
>>>
>>> I'm pretty sure this is the case and I was considering implementing
>>> the test mentioned in the paper (also I feel like it's a very useful
>>> algorithm to have).
>>>
>>> In any case, since the vectors are so sparse, why not reduce their
>>> dimension?
>>>
>>> You can try principal component analysis (just getting the first k
>>> eigenvectors in the singular value decomposition of the matrix that
>>> has your vectors as rows). The class that does this is SSVDSolver
>>> (there's also SingularValueDecomposition but that tries making dense
>>> matrices and those might not fit into memory. I've never personally
>>> used it though.
>>> Once you have the first k eigenvectors of size n, make them rows in a
>>> matrix (U) and multiply each vector x you have with it (U x) getting a
>>> reduced vector.
>>>
>>> Or, use random projections to reduce the size of the data set. You
>>> want to create a matrix whose entries are sampled from a uniform
>>> distribution (0, 1) (Functions.random in
>>> o.a.m.math.function.Functions), normalize its rows and multiply each
>>> vector x with it.
>>>
>>> So, reduce the size of your vectors thereby making the dimensionality
>>> less of a problem and you'll get a decent approximation (you can
>>> actually quantify how good it is with SVD). From what I've seen, the
>>> clusters separate at smaller dimensions but there's the question of
>>> how good an approximation of the uncompressed data you have.
>>>
>>> See if this helps, I need to do the same thing :)
>>>
>>> What do you think?
>>>
>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>>
>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>> <sebastian.briesemeister@unistergmbh.de> wrote:
>>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>>> words. However, each document contains usually only about 10 to 20
>>>> words. Only some contain more than 1000 words.
>>>>
>>>> For each document, I set a field in the corresponding vector to 1 if
>>> it
>>>> contains a word. Then I normalize each vector using the L2norm.
>>>> Finally I multiply each element (representing a word) in the vector
>>> by
>>>> log(#documents/#documents_with_word).
>>>>
>>>> For clustering, I am using cosine similarity.
>>>>
>>>> Regards
>>>> Sebastian
>>>>
>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>> Hi,
>>>>>
>>>>> Could you tell us more about the kind of data you're clustering?
>>> What
>>>>> distance measure you're using and what the dimensionality of the
>>> data
>>>>> is?
>>>>>
>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>> <sebastian.briesemeister@unistergmbh.de> wrote:
>>>>>> Dear Mahoutusers,
>>>>>>
>>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>> c
>>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>>
>>>>>> 1.) I always end up with one large set of rubbish instances. All
of
>>> them
>>>>>> have uniform cluster probability distribution and are, hence, in
>>> the
>>>>>> exact middle of the cluster space.
>>>>>> The cosine distance between instances within this cluster reaches
>>> from 0
>>>>>> to 1.
>>>>>>
>>>>>> 2.) Some of my clusters have the same or a very very similar
>>> center.
>>>>>>
>>>>>> Besides the above described problems, the clustering seems to work
>>> fine.
>>>>>>
>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>>
>>>>>> Regards
>>>>>> Sebastian
>>
>> 
>> Diese Nachricht wurde von meinem AndroidMobiltelefon mit K9 Mail
>> gesendet.
