mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <and...@butkus.co.uk>
Subject Mahout Naive Bayes Classification top words
Date Tue, 15 Oct 2013 14:19:55 GMT
Hi i was wondering if you could help,

I've set up mahout to provide some classification for news articles, so i can extract only
those news articles which are of interest.

I've gone through an manually trained the titles of these news articles, done approximately
80,000 (both articles i want and don't want)

I have written an app which outputs the top words and their scores, and it seems certain keywords
are creeping high up on the top words.

Some of the so called top words are false positives, - they are only top because every title
page has them.


such as 'stratford herald' (which is a name of the newspaper) - is there anyway to remove
them once a model is already created?

There are about 20 top words which i would like to simply get rid off (or get mahout to ignore
when providing best labels), but i don't want this to be an exercise on input (i.e. filtering
those names id like to exclude on training input), i'd prefer to post remove as I've already
spent a lot of time manually training.


Top words
- home: 1067
- dorset: 1493
- details: 908
- back: 867
- poole: 1651
- set: 819
- help: 743
- get: 812
- bournemouth: 14728
- new: 2661
- avon: 2684
- local: 3092
- cherries: 1244
- police: 1011
- over: 1813
- echo: 6526
- null: 79983
- after: 2292
- stratford: 2657
- school: 1395
- jobs: 881
- job: 6982
- car: 772
- herald: 2817
- nurse: 1174
- man: 1335
- manager: 1071
- day: 759
- time: 764
- council: 824
- upon: 2676
Number of labels: 2
Number of documents in training set: 79983
Top 75 words for label negative_article
- stratford: 10748.598348617554
- herald: 7579.555884361267
- avon: 7484.692479610443
- upon: 7476.3635239601135
- local: 7426.4039397239685
- after: 3837.6605548858643
- man: 3512.4373264312744
- police: 2586.899124145508
- over: 1537.557123184204
- woman: 1434.1630334854126
Top 75 words for label other
- bournemouth: 39076.86379265785
- job: 24028.39960718155
- echo: 22974.801107406616
- new: 10888.526140213013
- stratford: 8045.635549545288
- poole: 7493.278381347656
- over: 7077.8266887664795
- school: 7011.863867282867
- local: 7004.647378444672
- dorset: 6961.040742397308
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message