mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <>
Subject Re: Naive Bayes Classifier as a Recommender
Date Wed, 16 Oct 2013 08:00:17 GMT
essentially you provide the labels, and naive model as inputs, and generate some scores. 'bestScore'
will contain a value which determines how accurate the inputted data is, compared to your
naive bays model, and this value is something you can use as a range for how good the match
is ...

Hope this helps, apologies for the code, it was taken from a model class i created so you
will need to modify things around to get something working, but the API calls and logic is


	public static Map<String, Integer> readDictionary(Configuration conf, Path dictionaryPath)

		Map<String, Integer> dictionnary = new HashMap<String, Integer>();

		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionaryPath,
true, conf)) 
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		return dictionnary;

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath)

		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable,
LongWritable>(documentFrequencyPath, true, conf)) 
			documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
		return documentFrequency;

	 * Gets the best label for the input data 
	 * @param  data  the data which will be classified
	public String GetBestLabel(String data) throws IOException
		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(m_model);

		// labels is a map label => classId
		Map<Integer, String> labels = BayesUtils.readLabelIndex(m_configuration, new Path(m_labelIndexPath));
		Map<String, Integer> dictionary = readDictionary(m_configuration, new Path(m_dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(m_configuration, new

		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
		int documentCount = documentFrequency.get(-1).intValue();
		Multiset<String> words = ConcurrentHashMultiset.create();
		// extract words from text
		TokenStream ts = analyzer.tokenStream("text", new StringReader(data));
		CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
		int wordCount = 0;
		while (ts.incrementToken()) 
			if (termAtt.length() > 0) 
				String word = ts.getAttribute(CharTermAttribute.class).toString();
				Integer wordId = dictionary.get(word);
				if (wordId != null) 

		// create vector wordId => weight using tfidf
		Vector vector = new RandomAccessSparseVector(10000);
		TFIDF tfidf = new TFIDF();
		for (Multiset.Entry<String> entry: words.entrySet()) 
			String word = entry.getElement();
			int count = entry.getCount();
			Integer wordId = dictionary.get(word);
			Long freq = documentFrequency.get(wordId);
			double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
			vector.setQuick(wordId, tfIdfValue);
		Vector resultVector = classifier.classifyFull(vector);
		double bestScore = -Double.MAX_VALUE;
		int bestCategoryId = -1;
		for (Element element: resultVector) 
			int categoryId = element.index();
			double score = element.get();
			if (score > bestScore) 
				bestScore = score;
				bestCategoryId = categoryId;
			//System.out.print("  " + labels.get(categoryId) + ": " + score);

		m_score = bestScore;
		return labels.get(bestCategoryId);

On 16 Oct 2013, at 07:15, Pat Cunnane <> wrote:

> Thanks Andrew. I'd be interested to see what you're doing with the tfidf scores. If you
could post some code that'd be awesome.
> On Wed, Oct 16, 2013 at 6:47 PM, Andrew Butkus <> wrote:
> Ive been using the tfidf class to generate scores. I then use this
> score to determine how good the classification is, if u need more info
> say, and i can get u some code
> Sent from my Windows Phone From: Pat Cunnane
> Sent: 15/10/2013 23:00
> To:
> Subject: Naive Bayes Classifier as a Recommender
> Hi, I've got a dataset of millions of short documents (think twitter) that
> can be in one of about 30,000 categories. When a user is creating a new
> document, I want to suggest a list of 5 possible categories for that
> document to go into.
> Right now I'm using the Naive Bayes classifier in mahout and sorting the
> results by score. My problem is that sometimes the recommender is not very
> accurate and I'd like to know:
> Is there any way to find out a confidence level for a classification?
> Ideally then I could set a threshold and not display recommendations if the
> classifier is not confident.
> Also, would it be better to consider another algorithm to achieve my goals?
> I chose Naive Bayes because my dataset is pure text and very large. Any
> thoughts would be greatly appreciated.
> Thanks,
> Pat

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message