mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <and...@butkus.co.uk>
Subject RE: Mahout Naive Bayes Classification top words
Date Wed, 16 Oct 2013 09:00:40 GMT
I thought about this, but ive already spent a few days manually
training 80,000 titles, so wanted to try and avoid doing this again,
hence post analysis.

Ive had a few suggestions that altering the model post creation is a
bad idea, as it breaks the underlying mechanics, so looks like i will
have to go through and train again :(

Sent from my Windows Phone From: Suneel Marthi
Sent: 15/10/2013 16:25
To: user@mahout.apache.org; Andrew Butkus
Subject: Re: Mahout Naive Bayes Classification top words
You could run the terms that you would like to not see through a
StopWordFilter while training on your articles' titles.
As an example, Lucene comes with a default StopWordFilter; you could
create something similar for your scenario and run your text through
this filter (for both training and test).






On Tuesday, October 15, 2013 10:20 AM, Andrew Butkus
<andrew@butkus.co.uk> wrote:

Hi i was wondering if you could help,

I've set up mahout to provide some classification for news articles,
so i can extract only those news articles which are of interest.

I've gone through an manually trained the titles of these news
articles, done approximately 80,000 (both articles i want and don't
want)

I have written an app which outputs the top words and their scores,
and it seems certain keywords are creeping high up on the top words.

Some of the so called top words are false positives, - they are only
top because every title page has them.


such as 'stratford herald' (which is a name of the newspaper) - is
there anyway to remove them once a model is already created?

There are about 20 top words which i would like to simply get rid off
(or get mahout to ignore when providing best labels), but i don't want
this to be an exercise on input (i.e. filtering those names id like to
exclude on training input), i'd prefer to post remove as I've already
spent a lot of time manually training.


Top words
- home: 1067
- dorset: 1493
- details: 908
- back: 867
- poole: 1651
- set: 819
- help: 743
- get: 812
- bournemouth: 14728
- new: 2661
- avon: 2684
- local: 3092
- cherries: 1244
- police: 1011
- over: 1813
- echo: 6526
- null: 79983
- after: 2292
- stratford: 2657
- school: 1395
- jobs: 881
- job: 6982
- car: 772
- herald: 2817
- nurse: 1174
- man: 1335
- manager: 1071
- day: 759
- time: 764
- council: 824
- upon: 2676
Number of labels: 2
Number of documents in training set: 79983
Top 75 words for label negative_article
- stratford: 10748.598348617554
- herald: 7579.555884361267
- avon: 7484.692479610443
- upon: 7476.3635239601135
- local: 7426.4039397239685
- after: 3837.6605548858643
- man: 3512.4373264312744
- police: 2586.899124145508
- over: 1537.557123184204
- woman: 1434.1630334854126
Top 75 words for label other
- bournemouth: 39076.86379265785
- job: 24028.39960718155
- echo: 22974.801107406616
- new: 10888.526140213013
- stratford: 8045.635549545288
- poole: 7493.278381347656
- over: 7077.8266887664795
- school: 7011.863867282867
- local: 7004.647378444672
- dorset: 6961.040742397308

Mime
View raw message