mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <and...@butkus.co.uk>
Subject Re: Naive Bayes Classifier as a Recommender
Date Wed, 16 Oct 2013 08:00:17 GMT
essentially you provide the labels, and naive model as inputs, and generate some scores. 'bestScore'
will contain a value which determines how accurate the inputted data is, compared to your
naive bays model, and this value is something you can use as a range for how good the match
is ...

Hope this helps, apologies for the code, it was taken from a model class i created so you
will need to modify things around to get something working, but the API calls and logic is
there.

Andy

	public static Map<String, Integer> readDictionary(Configuration conf, Path dictionaryPath)

	{
		Map<String, Integer> dictionnary = new HashMap<String, Integer>();

		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionaryPath,
true, conf)) 
		{
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		}
				
		return dictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath)

	{
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable,
LongWritable>(documentFrequencyPath, true, conf)) 
		{
			documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
		}
		
		return documentFrequency;
	}

/**
	 * Gets the best label for the input data 
	 *
	 * @param  data  the data which will be classified
	 */
	public String GetBestLabel(String data) throws IOException
	{
		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(m_model);

		// labels is a map label => classId
		Map<Integer, String> labels = BayesUtils.readLabelIndex(m_configuration, new Path(m_labelIndexPath));
		Map<String, Integer> dictionary = readDictionary(m_configuration, new Path(m_dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(m_configuration, new
Path(m_documentFrequencyPath));

		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
		int documentCount = documentFrequency.get(-1).intValue();
		Multiset<String> words = ConcurrentHashMultiset.create();
		
		// extract words from text
		TokenStream ts = analyzer.tokenStream("text", new StringReader(data));
		
		CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
		ts.reset();
		
		int wordCount = 0;
		
		while (ts.incrementToken()) 
		{
			if (termAtt.length() > 0) 
			{
				String word = ts.getAttribute(CharTermAttribute.class).toString();
				Integer wordId = dictionary.get(word);
				
				if (wordId != null) 
				{
					words.add(word);
					wordCount++;
				}
			}
		}

		// create vector wordId => weight using tfidf
		Vector vector = new RandomAccessSparseVector(10000);
		TFIDF tfidf = new TFIDF();
		
		for (Multiset.Entry<String> entry: words.entrySet()) 
		{
			String word = entry.getElement();
			
			int count = entry.getCount();
			
			Integer wordId = dictionary.get(word);
			Long freq = documentFrequency.get(wordId);
			
			double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
			
			vector.setQuick(wordId, tfIdfValue);
		}
		
		Vector resultVector = classifier.classifyFull(vector);
		
		double bestScore = -Double.MAX_VALUE;
		int bestCategoryId = -1;
		
		for (Element element: resultVector) 
		{
			int categoryId = element.index();
			double score = element.get();
			
			if (score > bestScore) 
			{
				bestScore = score;
				bestCategoryId = categoryId;
			}
			//System.out.print("  " + labels.get(categoryId) + ": " + score);
		}
		
		//System.out.println();

		analyzer.close();
		
		m_score = bestScore;
		
		return labels.get(bestCategoryId);
	}



On 16 Oct 2013, at 07:15, Pat Cunnane <pcunnane@gmail.com> wrote:

> Thanks Andrew. I'd be interested to see what you're doing with the tfidf scores. If you
could post some code that'd be awesome.
> 
> 
> 
> 
> On Wed, Oct 16, 2013 at 6:47 PM, Andrew Butkus <andrew@butkus.co.uk> wrote:
> Ive been using the tfidf class to generate scores. I then use this
> score to determine how good the classification is, if u need more info
> say, and i can get u some code
> 
> Sent from my Windows Phone From: Pat Cunnane
> Sent: 15/10/2013 23:00
> To: user@mahout.apache.org
> Subject: Naive Bayes Classifier as a Recommender
> Hi, I've got a dataset of millions of short documents (think twitter) that
> can be in one of about 30,000 categories. When a user is creating a new
> document, I want to suggest a list of 5 possible categories for that
> document to go into.
> 
> Right now I'm using the Naive Bayes classifier in mahout and sorting the
> results by score. My problem is that sometimes the recommender is not very
> accurate and I'd like to know:
> 
> Is there any way to find out a confidence level for a classification?
> Ideally then I could set a threshold and not display recommendations if the
> classifier is not confident.
> 
> Also, would it be better to consider another algorithm to achieve my goals?
> I chose Naive Bayes because my dataset is pure text and very large. Any
> thoughts would be greatly appreciated.
> 
> Thanks,
> 
> Pat
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message