mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: stop word-lists VS maxDFPercent
Date Mon, 12 May 2014 20:50:38 GMT

I usually recommend using a custom stop list based on your own corpus.  That tends to work
much better than general ones. 

I like using the doc frequency limit as well in case something goes strange on me. 

Sent from my iPhone

> On May 12, 2014, at 6:24, David Noel <david.i.noel@gmail.com> wrote:
> 
> What's everyone's opinion on using large stop word lists vs a very
> small value for maxDFPercent (like 30)? I'm playing around with both
> and am having trouble deciding whether one is better than the other,
> or if I should use a combination of both. My data set is one day's
> worth of news articles gathered from 1000 online news outlets. It's
> probably similar to the reuters data set, but with a little more
> noise. I used Boilerpipe for article extraction.
> 
> I spent a good while Googling around to build the largest (English)
> stop word-list I could. I'll paste it below for anyone who's
> interested and would like to save themselves an hour of Googling and
> collating.

Mime
View raw message