mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel McEnnis <dmcen...@gmail.com>
Subject Re: Identify "less similar" documents
Date Wed, 13 Apr 2011 16:19:44 GMT
Claudia,

The term to look up is 'one class classifier'.  Its built on this
problem with a set of solutions pre-made.  I don't know if anyone has
put it in a general classifier before, but the theory is there.

Daniel.

On Wed, Apr 13, 2011 at 11:56 AM, Claudia Grieco <grieco@crmpa.unisa.it> wrote:
> Thanks for the help :)
>> Why not just train with those documents and put a category tag of "other" on
>>them and run normal categorization?  If you can distinguish these documents
>>by word frequencies, then this should do the trick.
> I don't know if this will help
> 1)I'm still not sure where to put the threshold (if a document has word frequency less
than X...how to choose X?)
> 2)The classifier is built incrementally: a document who would be classified as "other"
today may be classified as "new category the user has just added" tomorrow. New docs in the
training set and new categories are added from time to time.
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: mercoledì 13 aprile 2011 17.34
> A: user@mahout.apache.org
> Cc: Claudia Grieco
> Oggetto: Re: Identify "less similar" documents
>
> I think that what you are doing is inventing an "other" category and
> building a classifier for that category.
>
> Why not just train with those documents and put a category tag of "other" on
> them and run normal categorization?  If you can distinguish these documents
> by word frequencies, then this should do the trick.
>
> On Wed, Apr 13, 2011 at 7:49 AM, Claudia Grieco <grieco@crmpa.unisa.it>wrote:
>
>> Let's see if this approach makes sense:
>> I have the documents to classify on a Lucene index (Index A) and the
>> training set in another Lucene index (Index B).
>> With a VectorMapper I map Term-Frequency Vectors of Index A to
>> Term-Frequency Vectors of Index B. In this way the transformed vectors have
>> only the frequency of the terms of the training set.
>> By computing vector.zSum() I should get the frequency of the terms in the
>> training set for the document, right?
>> I compute vector.zSum() for all the docs to classify and exclude from the
>> classification the ones who have a sum value of less than 10% the max
>> vector.zSum()=>they mostly contain words never seen before and could be
>> classified wrongly.
>>
>> What do you think?
>>
>> -----Messaggio originale-----
>> Da: Claudia Grieco [mailto:grieco@crmpa.unisa.it]
>> Inviato: mercoledì 13 aprile 2011 11.12
>> A: user@mahout.apache.org
>> Oggetto: Identify "less similar" documents
>>
>> Hi guys,
>>
>> I'm using SGD to classify a set of documents but I have a problem: there
>> are
>> some documents that are not related to any of the categories and I want to
>> be able to identify them and exclude them from the classification. My idea
>> is to read the documents of the training set (that are currently in a
>> Lucene
>> index) and identify the docs that have less terms in common with them. Any
>> idea on how to do it?
>>
>> Thanks a lot
>>
>> Claudia
>>
>>
>>
>
>

Mime
View raw message