mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Baoqiang Cao <bqcaom...@gmail.com>
Subject the -md option and the word list
Date Mon, 02 Apr 2012 23:06:31 GMT
Hi,

I'm using unigram in parsing text files. In seq2sparse step, the
option "-md" yielded results different with my interpretation of its
meaning and I'd like to get help on clarifying it.

This is what I did: seq2spase->kmeans->clusterdump. After cluster
dump, I collected the VL- vectors and extract all the words showing up
in the vectors. I was using "-md 3", "-md 200", and "-md 2000". So my
interpretation goes like this, "-md 2000" means a word is considered
as a feature only if it occurs more than 2000 documents in my data. So
it should have smaller set of features/words left compared to "-md 3"
or "-md 200". But actually I got the completely opposite, with "-md
2000" I got the most amount of words in VL- vectors.

Is my interpretation wrong? Any help? Thanks.

Best,
Baoqiang

Mime
View raw message