mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Claudia Grieco" <>
Subject R: Identify "less similar" documents
Date Tue, 19 Apr 2011 09:55:00 GMT
Thanks for the suggestion, I'm currently trying this hack:
I take the documents of the training set and put in each cluster all the docs of a certain
I compute the centroid for each category cluster
I compute the distance of each new document to all centroids (I'm using CosineDistanceMeasure)
and I identify as "outlier" the ones who have distance more than X

Do you think this makes sense?

-----Messaggio originale-----
Da: Lance Norskog [] 
Inviato: venerdì 15 aprile 2011 4.27
Oggetto: Re: Identify "less similar" documents

Thinking of this in terms of clustering, outliers/misfits are
one-vector clusters, items that are far away from all others.
Clustering would be a slow system for finding these outliers, but an
interesting way to check them:

Cluster a sampled set of your items. Save the centroid and radius of
each cluster. To verify an outlier, look at its distance to all
centroids, and whether it is in the radius of the closest (few).

Given a clustering algorithm you like, and a different distance method
than the categorization measure, this gives a good cross-check of the

On 4/14/11, Ted Dunning <> wrote:
> Hand classify all the documents that you can into the categories that you
> know.
> Classify the ones that don't fit into "other".
> On Thu, Apr 14, 2011 at 12:51 AM, Claudia Grieco
> <>wrote:
>> Thanks to everyone :)
>> So I should train the category "other" with some documents...but what
>> documents?
>> I should identify them first...that's a bit of a "chicken and egg" problem
>> Maybe I should do this way:
>> -each day X new documents arrive to be classified
>> -I find 10-11 docs with a low word freq in respect to the training set(but
>> what is a "low" value?)  and train them as other
>> -classify everything with the updated classifier
>> -----Messaggio originale-----
>> Da: Ted Dunning []
>> Inviato: mercoledì 13 aprile 2011 19.29
>> A:
>> Cc: Claudia Grieco
>> Oggetto: Re: Identify "less similar" documents
>> On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco <
>> >wrote:
>> > Thanks for the help :)
>> > > Why not just train with those documents and put a category tag of
>> "other"
>> > on
>> > >them and run normal categorization?  If you can distinguish these
>> > documents
>> > >by word frequencies, then this should do the trick.
>> > I don't know if this will help
>> >
>> Only an experiment will tell you.
>> > 1)I'm still not sure where to put the threshold (if a document has word
>> > frequency less than to choose X?)
>> >
>> The classifier should handle that for you for the most part.  Again,
>> experimentation is the way to go here.  My first cut would be to assign to
>> the category with the highest score, possibly including the other
>> category.
>> > 2)The classifier is built incrementally: a document who would be
>> classified
>> > as "other" today may be classified as "new category the user has just
>> added"
>> > tomorrow. New docs in the training set and new categories are added from
>> > time to time.
>> >
>> That is pretty easy.  Just retrain with the new category assignments.

Lance Norskog

View raw message