mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Feinstein <yuv...@citypath.com>
Subject Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.
Date Tue, 07 Aug 2012 07:08:07 GMT
This is the case:
https://issues.apache.org/jira/browse/MAHOUT-973
The bug exists in Mahout 0.6 and was fixed in Mahout 0.7.
I also used the workaround of using a high value for --maxDFPercent
(I guess the number of documents in the corpus is enough).
Maybe it will be good to fix it on 0.6 as well?
Thanks,
Yuval

On Fri, Aug 3, 2012 at 11:55 PM, Sean Owen <srowen@gmail.com> wrote:
> This sounds a lot like a bug that was fixed by a patch some time ago. Grant
> I think it was something I had wanted you to double-check, not sure if you
> had a look. But I think it was fixed if it's the same issue.
>
> On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel <p.abramov@rambler-co.ru>wrote:
>
>> Thanks for this idea.
>>
>> Looks like a bug:
>> 1) Setting --maxDFPercent to 100 has no effect
>> 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.
>>
>> seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a
>> percentage. maxDFPercent is absolute value.
>>
>>
>> Pavel
>>
>>
>>
>>
>> 01.08.12 20:46 пользователь "Robin Anil" <robin.anil@gmail.com>
написал:
>>
>> >Tfidf job is where the document frequency pruning is applied. Try
>> >increasing maxDFPercent to 100 %
>> >
>> >On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
>> ><p.abramov@rambler-co.ru>wrote:
>> >
>> >> Hello!
>> >>
>> >> I have trouble running the example "seq2sparse" with TFIDF weights. My
>> >>TF
>> >> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
>> >> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
>> >>20
>> >> terms, while Document1 in TFIDF vector
>> >>  has only 2 terms. What is wrong? I spent 2 days finding the answer and
>> >> configuring seq2sparse parameters ((
>> >>
>> >> Thanks in advance!
>> >>
>> >> mahout seq2sparse -ow  \
>> >> -chunk 512 \
>> >> --maxDFPercent 90 \
>> >> --maxNGramSize 1 \
>> >> --numReducers 128 \
>> >> --minSupport 150 \
>> >> -i --- \
>> >> -o --- \
>> >> -wt tfidf \
>> >> --namedVector \
>> >> -a org.apache.lucene.analysis.WhitespaceAnalyzer
>> >>
>> >> Pavel
>> >>
>> >>
>>
>>

Mime
View raw message