mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Naive Bayes Classifier as a Recommender
Date Sun, 20 Oct 2013 17:06:04 GMT
To use a classifier you need to have the training data labeled so I assume you have a bunch
of these small docs labeled for the  30,000 categories? See BTW-2 for things to consider in

One rather simple way to suggest a category would be to throw each small doc into a category
collection and treat it as a single document. Then index the category docs using Solr. This
will automatically apply TFIDF to the internal terms. You can then use a new small doc as
the more-like-this query for Solr, which will return a list of the most similar categories
ordered by strength.

Very simple to create from labeled data. Furthermore you can hold out some of the labeled
small docs and apply precision measurements to the Solr results to see how well it is doing.
You can also use a precision measurement on your classifier method in the same way. Since
you have a base truth from the labeled data you can compare the precision of these two methods
using exactly the same training and hold-out sets.

BTW in the above scenario you don't have to use training data that has only one label per
small doc. In this case you would be using it to suggest tags not non-overlapping categories.

BTW-2 Be careful about precision measurements of a system in use. If you are suggesting categories
for a small doc this will affect which the user chooses and so is somewhat self-fulfilling.
If you are using these user choices as your base-truth labels they will likely have a strong
bias. There are various ways around this, just be aware that it's happening.

On Oct 15, 2013, at 3:00 PM, Pat Cunnane <> wrote:

Hi, I've got a dataset of millions of short documents (think twitter) that
can be in one of about 30,000 categories. When a user is creating a new
document, I want to suggest a list of 5 possible categories for that
document to go into.

Right now I'm using the Naive Bayes classifier in mahout and sorting the
results by score. My problem is that sometimes the recommender is not very
accurate and I'd like to know:

Is there any way to find out a confidence level for a classification?
Ideally then I could set a threshold and not display recommendations if the
classifier is not confident.

Also, would it be better to consider another algorithm to achieve my goals?
I chose Naive Bayes because my dataset is pure text and very large. Any
thoughts would be greatly appreciated.



View raw message