mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: seq2sparse seems to be ignoring the value of my “-x” parameter
Date Wed, 26 Sep 2012 07:49:23 GMT
Hi-

It is possible to run these batch jobs inside Eclipse, in single-step mode. Sometimes this
is the only way to understand how a number is used.

----- Original Message -----
| From: "Matt Molek" <mpmolek@gmail.com>
| To: user@mahout.apache.org
| Sent: Tuesday, September 25, 2012 10:25:09 AM
| Subject: seq2sparse seems to be ignoring the value of my “-x” parameter
| 
| I'm using mahout 0.7 on a pseudo-distributed hadoop installation for
| testing purposes.
| 
| A lot of what I'm doing is being guided by Mahout in Action, which I
| know deals with 0.5, but as far as I can tell, nothing major has
| changed with seq2sparse.
| 
| I'm having a problem with the tfidf vectors generated by seq2sparse.
| No matter what I set "-x" (max document frequency percentage) to, I
| end up with the same number of terms in my dictionary, and vectors of
| the same size. Shouldn't I be getting smaller tfidf vectors as my -x
| value decreases?
| 
| I found one posting about mahout 0.6 where -x was being parsed as an
| absolute number of documents rather than a percentage of documents.
| That was supposed to have been fixed in 0.7, but I tried using it in
| that way too just to see if it would help. No change in the number of
| terms I'm getting. Here are the values I've tried, and the number of
| terms I've ended up with. My data set is 4850 wikipedia articles
| from:
| http://dumps.wikimedia.org/enwiki/20110803/
| 
| The exact file is: pages-articles1.xml.bz2
| 
| The xml file was turned into a seqfile with:
| 
| mahout seqwiki -all -i <path to xml file> -o <path to output
| directory>
| 
| My calls to seq2sparse look like this:
| 
| mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x
| 4800 -nv
| 
| My results:
| 
| |-x value	| #of terms |
| |4800		|  256623   |
| |4600		|  256623   |
| |2500		|  256623   |
| |99		|  256623   |
| |90		|  256623   |
| |25		|  256623   |
| |5		|  256623   |
| 
| Any ideas on what I'm doing wrong? Thanks for the help.
| 

Mime
View raw message