mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.
Date Fri, 03 Aug 2012 20:55:03 GMT
This sounds a lot like a bug that was fixed by a patch some time ago. Grant
I think it was something I had wanted you to double-check, not sure if you
had a look. But I think it was fixed if it's the same issue.

On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel <p.abramov@rambler-co.ru>wrote:

> Thanks for this idea.
>
> Looks like a bug:
> 1) Setting --maxDFPercent to 100 has no effect
> 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.
>
> seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a
> percentage. maxDFPercent is absolute value.
>
>
> Pavel
>
>
>
>
> 01.08.12 20:46 пользователь "Robin Anil" <robin.anil@gmail.com> написал:
>
> >Tfidf job is where the document frequency pruning is applied. Try
> >increasing maxDFPercent to 100 %
> >
> >On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
> ><p.abramov@rambler-co.ru>wrote:
> >
> >> Hello!
> >>
> >> I have trouble running the example "seq2sparse" with TFIDF weights. My
> >>TF
> >> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
> >> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
> >>20
> >> terms, while Document1 in TFIDF vector
> >>  has only 2 terms. What is wrong? I spent 2 days finding the answer and
> >> configuring seq2sparse parameters ((
> >>
> >> Thanks in advance!
> >>
> >> mahout seq2sparse -ow  \
> >> -chunk 512 \
> >> --maxDFPercent 90 \
> >> --maxNGramSize 1 \
> >> --numReducers 128 \
> >> --minSupport 150 \
> >> -i --- \
> >> -o --- \
> >> -wt tfidf \
> >> --namedVector \
> >> -a org.apache.lucene.analysis.WhitespaceAnalyzer
> >>
> >> Pavel
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message