mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: n-gram and ml
Date Sat, 09 Jun 2012 18:39:10 GMT
------
Robin Anil


On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> As I understand it when using seq2sparse with ng = 2 and ml = some large
> number. This will never create a vector with less terms than words (all
> other pars of the algorithm set aside). In other words ng = 2 and ml = 2000
> will create very few n-grams but will never create a 0 length vector unless
> there are no terms to begin with.
>
> Is this correct?
>
> I ask because it looks like many of my n-grams are not really helpful so I
> keep tuning the ml upwards but Robin made a comment that this might cause 0
> length vectors, in which case I might want to stop using n-grams.
>

You didnt quite get me.
I meant ml = minimum log likelihood threshold. an bigram of loglikelihood
1.0 is quite a significant ngram. if you say  ml > 2000, there might not be
any ngram that has such a score. Secondly, df pruning of 40% along with ml
>200 threshold are creating vectors in your dataset devoid of features, i.e
empty vectors.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message