mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: TFIDFPartialVectorReducer minDf
Date Sat, 22 Sep 2012 11:15:23 GMT

On Sep 20, 2012, at 1:55 PM, Dave Byrne wrote:

> In TFIDFPartialVectorReducer.java:
> 
> If docFreq > maxDocFreq then the vector at that index is not set (ignored)
> If docFreq < minDocFreq then the vector at that index is set to the TfIdf calculation
using minDocFreq instead of the actual document frequency.
> 
> Should minDocFreq not be treated the same as maxDocFreq by skipping setting the vector
at that index?

I think the idea is that it is being rounded up to provide some minimum level of input.  It's
always a bit of a hedge w/ these rare terms.  Sometimes they are just garbage, other times,
they are valuable.  My leaning would be towards keeping it as is.

> 
> In both cases, the vector length remains the same and these settings have no effect on
pruning the vector length / term reduction?
> 
> 
> NOTICE: This message and any attachments are intended only for the use of the addressee
and may contain confidential, proprietary and/or privileged information. If you are not the
intended recipient, any review, use, distribution, dissemination or copying of this email
is prohibited. If you have received this email in error, please notify the sender by replying
to this message and delete this email immediately. Securities trading, account management,
and investment banking services are offered by MDB Capital Group LLC, a registered broker-dealer
and member of FINRA and SIPC. Unless clearly stated, nothing herein shall be construed to
be an offer to sell, nor a solicitation of an offer to buy, any financial product.

--------------------------------------------
Grant Ingersoll
http://www.lucidworks.com





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message