mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Byrne <dby...@mdb.com>
Subject Re: seq2sparse seems to be ignoring the value of my “-x” parameter
Date Tue, 25 Sep 2012 19:54:38 GMT
maxDFPercent won't actually remove the terms from the dictionary, or reduce the size of the
tfidf vectors. It simply sets the value of the vector to 0 for that term.

In other words, the dictionary size and vector length will remain the same, with fewer non-zero
terms.


On Sep 25, 2012, at 10:25 AM, Matt Molek <mpmolek@gmail.com> wrote:

> I'm using mahout 0.7 on a pseudo-distributed hadoop installation for
> testing purposes.
>
> A lot of what I'm doing is being guided by Mahout in Action, which I
> know deals with 0.5, but as far as I can tell, nothing major has
> changed with seq2sparse.
>
> I'm having a problem with the tfidf vectors generated by seq2sparse.
> No matter what I set "-x" (max document frequency percentage) to, I
> end up with the same number of terms in my dictionary, and vectors of
> the same size. Shouldn't I be getting smaller tfidf vectors as my -x
> value decreases?
>
> I found one posting about mahout 0.6 where -x was being parsed as an
> absolute number of documents rather than a percentage of documents.
> That was supposed to have been fixed in 0.7, but I tried using it in
> that way too just to see if it would help. No change in the number of
> terms I'm getting. Here are the values I've tried, and the number of
> terms I've ended up with. My data set is 4850 wikipedia articles from:
> http://dumps.wikimedia.org/enwiki/20110803/
>
> The exact file is: pages-articles1.xml.bz2
>
> The xml file was turned into a seqfile with:
>
> mahout seqwiki -all -i <path to xml file> -o <path to output directory>
>
> My calls to seq2sparse look like this:
>
> mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x 4800 -nv
>
> My results:
>
> |-x value     | #of terms |
> |4800         |  256623   |
> |4600         |  256623   |
> |2500         |  256623   |
> |99           |  256623   |
> |90           |  256623   |
> |25           |  256623   |
> |5            |  256623   |
>
> Any ideas on what I'm doing wrong? Thanks for the help.


NOTICE: This message and any attachments are intended only for the use of the addressee and
may contain confidential, proprietary and/or privileged information. If you are not the intended
recipient, any review, use, distribution, dissemination or copying of this email is prohibited.
If you have received this email in error, please notify the sender by replying to this message
and delete this email immediately. Securities trading, account management, and investment
banking services are offered by MDB Capital Group LLC, a registered broker-dealer and member
of FINRA and SIPC. Unless clearly stated, nothing herein shall be construed to be an offer
to sell, nor a solicitation of an offer to buy, any financial product.

Mime
View raw message