mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Baoqiang Cao <bqcaom...@gmail.com>
Subject Re: tokenizer for text
Date Fri, 18 May 2012 14:56:34 GMT
In addition. You could try to increase the word occurance thresholds
in -s and -md options.

On Fri, May 18, 2012 at 9:41 AM, John Conwell <john@iamjohn.me> wrote:
> What do you have in mind as far as a different tokenizer?  Are you doing
> stopword filtering?  Maybe look at the stopword list and see if there are
> other noise words you wish to add.  If you are using Lucene to filter
> stopwords, its stopword list if pretty small(20 or so words).  Stemming is
> another method often used to reduce your feature space.  You could look
> at lemmatization instead of stemming.  It wont reduce the feature space as
> much, but could help in normalizing different terms with the same lemme.
>
> You can put together your own lucene analyzer with whatever lucene filter
> pipeline you want into SparseVectorsFromSequenceFiles in order to replace
> the stock tokenizer.
>
>
>
> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <l.allen09@gmail.com> wrote:
>
>> Hi List,
>>
>> I am trying to use Mahout to do cluster on text. The problem is after
>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>> tf-idf vector is too high (about 50K) and it increases as the number
>> of document increases. I think there are two ways to handle that. One
>> is to use dimension reduction. The other one is to used a better
>> tokenizer which should be the better option.
>>
>> My questions are
>>
>> 1) how can I change the default tokenizer? or where can I find a new one?
>> 2) Is there a third option for me to deal with the number of dimension?
>>
>> Thanks a lot.
>>
>> --
>> Regards,
>> Jiaan
>>
>
>
>
> --
>
> Thanks,
> John C

Mime
View raw message