mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Baoqiang Cao <>
Subject Re: tokenizer for text
Date Fri, 18 May 2012 14:56:34 GMT
In addition. You could try to increase the word occurance thresholds
in -s and -md options.

On Fri, May 18, 2012 at 9:41 AM, John Conwell <> wrote:
> What do you have in mind as far as a different tokenizer?  Are you doing
> stopword filtering?  Maybe look at the stopword list and see if there are
> other noise words you wish to add.  If you are using Lucene to filter
> stopwords, its stopword list if pretty small(20 or so words).  Stemming is
> another method often used to reduce your feature space.  You could look
> at lemmatization instead of stemming.  It wont reduce the feature space as
> much, but could help in normalizing different terms with the same lemme.
> You can put together your own lucene analyzer with whatever lucene filter
> pipeline you want into SparseVectorsFromSequenceFiles in order to replace
> the stock tokenizer.
> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <> wrote:
>> Hi List,
>> I am trying to use Mahout to do cluster on text. The problem is after
>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>> tf-idf vector is too high (about 50K) and it increases as the number
>> of document increases. I think there are two ways to handle that. One
>> is to use dimension reduction. The other one is to used a better
>> tokenizer which should be the better option.
>> My questions are
>> 1) how can I change the default tokenizer? or where can I find a new one?
>> 2) Is there a third option for me to deal with the number of dimension?
>> Thanks a lot.
>> --
>> Regards,
>> Jiaan
> --
> Thanks,
> John C

View raw message