mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Santos <fernandoleandro1...@gmail.com>
Subject Re: SVM Implementation for mahout?
Date Mon, 09 Dec 2013 01:50:08 GMT
Hello,

I think my problem is related to the fact that the dataset is really
unbalanced. My 3 classes distribution are 550k, 150k and 70k. And
naivebayes make its classification also based on the probability of a class
c over all documents. So probably this unbalance is making a big difference.

Lucas, I'm just using the pre-processing available through seq2sparse.
Which is defining a minimum word frequency, and also a max document
frequency percentage (which work as a stoplist). And yes, I'm using the
tf-idf vectors for training and test.

Actually I had never heard of PCA and LDA. I'll take a look on it.

Thanks


2013/12/8 Lucas Fernandes Brunialti <lbrunialti@igcorp.com.br>

> Hi,
>
> Fernando, to get a better understanding of correlation, you could think of
> features as events in probability, then if the probability of the
> intersection is high, the events are high correlated...
>
> I agree with Ted. But usually, naive bayes  works well with text
> classification when you have a good pre-processing phase, using pca, tf-idf
> or lda... Are you doing any pre-processing?
> On Dec 8, 2013 3:25 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>
> >
> > The problem of correlation of features is clearly present in text, but it
> > is not so clear what the effect will be. For naive bayes this has the
> > effect of making the classifier over confident but it usually still works
> > reasonably well.  For logistic regression without regularization it can
> > cause the learning algorithm to fail (mahout'so logistic regression is
> > regularized, btw).
> >
> > Empirical evidence dominates theory in this situation.
> >
> > Sent from my iPhone
> >
> > > On Dec 8, 2013, at 9:14, Fernando Santos <
> fernandoleandro1991@gmail.com>
> > wrote:
> > >
> > > Now just a theoretical doubt. In a text classification example, what
> > would
> > > it mean to have features that are high correlated?  I mean, in this
> case
> > > our features are basically words, do you have an example of how these
> > > features can not be independant? This concept is not really clear in my
> > > mind...
> >
>



-- 
Fernando Santos
+55 61 8129 8505

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message