# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Yash Sharma <yash...@gmail.com>
Subject Re: Question Regarding Entropy calculation in Mahout
Date Fri, 23 May 2014 18:56:26 GMT
Well I was not aware of perplexity calculation. Your point makes perfect
sense.
Entropies calculated independently for each cluster would not serve any
purpose.

So the question moves back to the questioner and I'd move back to textbooks
:)

Peace,
Yash

On Sat, May 24, 2014 at 12:01 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Yash,
>
> I am not sure how your suggestion will work.
>
> The problem is clustering algorithms tend to make hard assignments.  Thus,
> if you try to compute entropy relative to some reference probability
> distribution (aka perplexity [1]) then a reference clustering will provide
> 1 or 0 as the probability.  Any item that gets classified into a different
> cluster will cause the Entropy to include a term - 1 log 0 which is
> infinite.
>
> One way to deal with this is to assign probability 1-\epsilon to the
> cluster an item is in and \epsilon/(k-1) for all the other clusters.  You
> then have issues finding a good value of \epsilon which seem to me to be
> out of scope for the original question.
>
> Computing entropy relative to the fraction of documents in each cluster is
> easier to compute, but much harder to understand.  Computing mutual
> information (not entropy) on the confusion matrix between two clusterings
> can also be done, but that also seems beyond the original question.
>
> As such, I think that the burden is on the original questioner to describe
> the problem more accurately.
>
>
>
> On Fri, May 23, 2014 at 11:21 AM, Yash Sharma <yash360@gmail.com> wrote:
>
> > Hi Darshan,
> > What i understand from your problem is that:
> > - You have clustered few documents
> > - You want to verify the accuracy of ur clustering , and you want to use
> > entropy for that
> > - You are not sure what should be the input for entropy calculation.
> >
> > Possible solution:
> > The entropy would expect a String[] to calculate the information
> contained
> > in the data/sequence.
> > One simplest way is to keep all the documents labelled with categories.
> > - Cluster the docs as you usually do.
> > - For entropy calculation create a String[] for every cluster. Each array
> > containing all the labels of the docs in the cluster.
> > cluster1 = {"sports", "tech", "tech", "tech", "book", ..}
> > cluster2 = {"sports", "drama", "sports", "sports"...}
> > etc
> >
> > - Calculate the entropy of each cluster.
> > Entropy would measure the degree of randomness of a system. High entropy
> > means there is high degree of randomness in a system.
> > Lower Entropy are desirable for validation of accuracy of your clustering
> > technique.
> >
> > P.S. You can use Entropy.java class for your validation purpose but
> > its deprecated now.
> >
> > Having Said that - Kindly be patient while asking questions and provide
> > more info on what work you have done so far with your findings. It would
> > enable all of us to answer quickly & correctly :)
> >
> > Hope it was helpful. Other Approaches are welcome..!!
> >
> > Peace,
> > Yash
> >
> >
> > On Fri, May 23, 2014 at 10:55 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > I am sorry, but I don't understand your questions or needs sufficiently
> > to
> > >
> > >
> > >
> > >
> > > On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
> > > darshan.sonagara@gmail.com> wrote:
> > >
> > > > sir please reply me as soon as possible
> > > > thanks in advance.
> > > >
> > > >
> > > > On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> > > > darshan.sonagara@gmail.com> wrote:
> > > >
> > > > > waiting for the replay sir .
> > > > >
> > > > >
> > > > > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > > > > darshan.sonagara@gmail.com> wrote:
> > > > >
> > > > >> Thnks for the Replay sir,
> > > > >>
> > > > >> actually i am doing clustering for gathering similar king of
> > document
> > > in
> > > > >> same cluster as much as possible.
> > > > >> i can see from output file by cluster dump by observing top term.
> > > > >> i also figure out that by varying Distance Measure Technique.
it
> > > > differs.
> > > > >> but i want some mathematical prof that it is better then other
> > > > technique.
> > > > >> so for that i need to calculate Entropy and pureness of cluster.
> > > > >> but i am not able to find any command in mahout which can give
me
> > > > entropy
> > > > >> as a result.
> > > > >> i found Entropy.java under mahout common math statistic package.
> > but i
> > > > >> don't what should i give it as input so that i can find entropy
or
> > > other
> > > > >> parameter. so i can find how much cluster is good or bed.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <
> ted.dunning@gmail.com
> > > > >wrote:
> > > > >>
> > > > >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> > > > >>> darshan.sonagara@gmail.com> wrote:
> > > > >>>
> > > > >>> > But the problem is that i want check that whether my
clustering
> > is
> > > > >>> good or
> > > > >>> > bad. so for that i need to calculate Entropy Value.
I am not
> > having
> > > > any
> > > > >>> > idea how to calculate entropy in mahout or by other
technique.
> > > > >>> > by finding entropy i can have good conclusion.
> > > > >>> > so please can anyone help me with these.
> > > > >>> >
> > > > >>>
> > > > >>> Actually, the way to tell whether your clustering is good
is to
> see
> > > if
> > > > it
> > > > >>> works for its intended use.
> > > > >>>
> > > > >>> What do you want to use clustering for?
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >>
> > > > >> *Regards From:*
> > > > >>
> > > > >> *Darshan  Sonagara*
> > > > >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > > >>
> > > > >> *Vice-Chairperson | **GCET IEEE SB.*
> > > > >>
> > > > >> (: +*91* 9408002452
> > > > >>
> > > > >>
> > > > >>
> > > > >>  : Darshan Sonagara<
> > > > >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Regards From:*
> > > > >
> > > > > *Darshan  Sonagara*
> > > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > > >
> > > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > > >
> > > > > (: +*91* 9408002452
> > > > >
> > > > >
> > > > >
> > > > >  : Darshan Sonagara<
> > > > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Regards From:*
> > > >
> > > > *Darshan  Sonagara*
> > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >
> > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > >
> > > > (: +*91* 9408002452
> > > >
> > > >
> > > >
> > > >  : Darshan Sonagara<