mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: one vector or many vectors?
Date Thu, 01 Nov 2012 16:38:37 GMT
On Thu, Nov 1, 2012 at 9:30 AM, paritosh ranjan
<paritoshranjan5@gmail.com>wrote:

> "It is often helpful to classify small parts of large articles and then
> somehow deal with these multiple classifications at the full document
> level."
>
> The way I understand it is :
>
> For an article, classify the paragraphs (for example) and then use this
> first level classification result as features to classify the complete
> document.
> Am I correct?
>

Yes.


> If yes, then a training set would also be needed at both paragraph level
> and document level, which I think would not be that easy to get.
>

An easy way to generate a training set is to simply apply the document
level label to all paragraphs.  This is kind of noisy, but can work.
 Better is to figure out training labels for the paragraphs which, as you
say, is lots of work.  Starting with document level training and then
manually reviewing problem paragraphs is a good way to speed up the
collection of training data.  Active learning is an excellent way to speed
up the collection of training data.  The simple heuristic of stratified
sampling based on score for each category is another way.

Sometimes you can finesse the issue.


> I think, the question is more about the reason behind choosing small pieces
> of documents for training or to just train by a single document which is
> the aggregation of all the training documents for a particular class.
>

The reason is that big documents are often about many different topics.
 Even moderate sized documents may have lots of different topics.  For
instance, it is common to have a paragraph like this in a scientific
article:

*    The authors would like to acknowledge the generous support of Blah and
> Booh for this work. Additional support was provided by Foombah incorporated.
> *


Is this paragraph about clustering, recommendations or early medieval
database transactions?  Getting it out of the classification problem for
the article itself will help make the classification of the article more
accurate.  For the record, I think that the topic/category for this
paragraph is "acknowledgement".

Another place where paragraph by paragraph classification is useful is in
discourse modeling or in certain kinds of named entity extraction tasks.
 In these situations, having a classifier that tells you whether to run the
extractor on a paragraph is very handy.


>
> Please correct me if I am wrong.
>

You are right.  It does make things harder.  It can also make them better.


>
> On Thu, Nov 1, 2012 at 9:39 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > Your mileage will vary.
> >
> > It is often helpful to classify small parts of large articles and then
> > somehow deal with these multiple classifications at the full document
> > level.
> >
> > Sometimes it is not helpful, especially if the small parts get too small.
> >
> > Try it both ways.  My tendency is to prefer to classify book-sized things
> > at a level smaller than a chapter and sometimes as small as a paragraph.
> >  Going below the paragraph level is usually bad.
> >
> > On Thu, Nov 1, 2012 at 3:23 AM, dennis zhuang <killme2008@gmail.com>
> > wrote:
> >
> > > Hi,all
> > >
> > >    I am using sgd classifier for our articles classification.I want to
> > > train a new model,but there is a problem.I can provide the learner a
> > large
> > > article or some small articles, but i extract only one vector for one
> > > article.Then i don't know is  there any difference between one vector
> and
> > > many vectors for learner when training? Should i provide the learner
> one
> > > large article or many small articles? I can't find any documents about
> > > this,can anybody help me?Thanks.
> > >
> > > --
> > > 庄晓丹
> > > Email:        killme2008@gmail.com xzhuang@avos.com
> > > Site:           http://fnil.net
> > > Twitter:      @killme2008
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message