mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abramov Pavel <p.abra...@rambler-co.ru>
Subject Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.
Date Thu, 02 Aug 2012 12:44:28 GMT
Thanks for this idea.

Looks like a bug:
1) Setting --maxDFPercent to 100 has no effect
2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.

seq2sparse cuts terms with DF > maxDFPercent. So maxDFPercent is not a
percentage. maxDFPercent is absolute value.


Pavel




01.08.12 20:46 пользователь "Robin Anil" <robin.anil@gmail.com> написал:

>Tfidf job is where the document frequency pruning is applied. Try
>increasing maxDFPercent to 100 %
>
>On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
><p.abramov@rambler-co.ru>wrote:
>
>> Hello!
>>
>> I have trouble running the example "seq2sparse" with TFIDF weights. My
>>TF
>> vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
>> seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
>>20
>> terms, while Document1 in TFIDF vector
>>  has only 2 terms. What is wrong? I spent 2 days finding the answer and
>> configuring seq2sparse parameters ((
>>
>> Thanks in advance!
>>
>> mahout seq2sparse -ow  \
>> -chunk 512 \
>> --maxDFPercent 90 \
>> --maxNGramSize 1 \
>> --numReducers 128 \
>> --minSupport 150 \
>> -i --- \
>> -o --- \
>> -wt tfidf \
>> --namedVector \
>> -a org.apache.lucene.analysis.WhitespaceAnalyzer
>>
>> Pavel
>>
>>


Mime
View raw message