mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <and...@butkus.co.uk>
Subject RE: Naive Bayes Classifier as a Recommender
Date Wed, 16 Oct 2013 10:21:02 GMT
  If u have the top 5 categories you can maybe compare the boost score
between each of them and generate a % for how much more the top
recommendation is from the lowest recommendation, and this will give u a
confidence. The bigger the gap, the better.

Sent from my Windows Phone
 ------------------------------
From: Andrew Butkus <andrew@butkus.co.uk>
Sent: 16/10/2013 11:08
To: Andrew Butkus <andrew@butkus.co.uk>; Pat Cunnane <pcunnane@gmail.com>
Cc: user@mahout.apache.org
Subject: RE: Naive Bayes Classifier as a Recommender


http://stackoverflow.com/questions/19097673/apache-mahout-naive-bayes-training-size

Here is the stack overflow question,

The output was, it doesn't really matter about what you put into nb, as
long as there's enough data in each category to balance the output score

Sent from my Windows Phone
 ------------------------------
From: Andrew Butkus <andrew@butkus.co.uk>
Sent: 16/10/2013 11:04
To: Pat Cunnane <pcunnane@gmail.com>
Cc: user@mahout.apache.org
Subject: RE: Naive Bayes Classifier as a Recommender

  I thought this too, to start with, but do some reading on tfidf,

There is a stackoverflow question i asked about nb, and the info i got back
was as follows:

Basically (and correct me if wrong) by using tfifd, between classes, the
output score generated is automatically balanced. So a comparison can be
made between 2 classes as to which is best fit.

You can then use say the top 5 best scores to recommend to your user which
category / label to put into

Sent from my Windows Phone
 ------------------------------
From: Pat Cunnane <pcunnane@gmail.com>
Sent: 16/10/2013 09:55
To: Andrew Butkus <andrew@butkus.co.uk>
Cc: user@mahout.apache.org
Subject: Re: Naive Bayes Classifier as a Recommender

Hey thanks for posting that Andy.

In your example, bestScore is the highest score from the classifier's
results. The problem I see is that that score doesn't stay within a defined
range.
So I don't think you can set a fixed threshold for bestScore since the
range of scores changes per classification.


On Wed, Oct 16, 2013 at 9:00 PM, Andrew Butkus <andrew@butkus.co.uk> wrote:

> essentially you provide the labels, and naive model as inputs, and
> generate some scores. 'bestScore' will contain a value which determines how
> accurate the inputted data is, compared to your naive bays model, and this
> value is something you can use as a range for how good the match is ...
>
> Hope this helps, apologies for the code, it was taken from a model class i
> created so you will need to modify things around to get something working,
> but the API calls and logic is there.
>
> Andy
>
> public static Map<String, Integer> readDictionary(Configuration conf, Path
> dictionaryPath)
> {
>  Map<String, Integer> dictionnary = new HashMap<String, Integer>();
>
> for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text,
> IntWritable>(dictionaryPath, true, conf))
>  {
> dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
> }
>  return dictionnary;
> }
>
> public static Map<Integer, Long> readDocumentFrequency(Configuration conf,
> Path documentFrequencyPath)
>  {
> Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
>  for (Pair<IntWritable, LongWritable> pair : new
> SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath,
> true, conf))
> {
>  documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
> }
>  return documentFrequency;
> }
>
> /**
>  * Gets the best label for the input data
>  *
>  * @param  data  the data which will be classified
>  */
>  public String GetBestLabel(String data) throws IOException
> {
> StandardNaiveBayesClassifier classifier = new
> StandardNaiveBayesClassifier(m_model);
>
> // labels is a map label => classId
> Map<Integer, String> labels = BayesUtils.readLabelIndex(m_configuration,
> new Path(m_labelIndexPath));
>  Map<String, Integer> dictionary = readDictionary(m_configuration, new
> Path(m_dictionaryPath));
> Map<Integer, Long> documentFrequency =
> readDocumentFrequency(m_configuration, new Path(m_documentFrequencyPath));
>
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
> int documentCount = documentFrequency.get(-1).intValue();
>  Multiset<String> words = ConcurrentHashMultiset.create();
>  // extract words from text
>  TokenStream ts = analyzer.tokenStream("text", new StringReader(data));
>  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
>  ts.reset();
>  int wordCount = 0;
>  while (ts.incrementToken())
> {
> if (termAtt.length() > 0)
>  {
> String word = ts.getAttribute(CharTermAttribute.class).toString();
> Integer wordId = dictionary.get(word);
>  if (wordId != null)
> {
> words.add(word);
>  wordCount++;
> }
> }
> }
>
> // create vector wordId => weight using tfidf
> Vector vector = new RandomAccessSparseVector(10000);
>  TFIDF tfidf = new TFIDF();
>  for (Multiset.Entry<String> entry: words.entrySet())
>  {
> String word = entry.getElement();
>  int count = entry.getCount();
>  Integer wordId = dictionary.get(word);
> Long freq = documentFrequency.get(wordId);
>  double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount,
> documentCount);
>  vector.setQuick(wordId, tfIdfValue);
> }
>  Vector resultVector = classifier.classifyFull(vector);
>  double bestScore = -Double.MAX_VALUE;
> int bestCategoryId = -1;
>  for (Element element: resultVector)
> {
> int categoryId = element.index();
>  double score = element.get();
>  if (score > bestScore)
>  {
> bestScore = score;
> bestCategoryId = categoryId;
>  }
> //System.out.print("  " + labels.get(categoryId) + ": " + score);
> }
>  //System.out.println();
>
> analyzer.close();
>   m_score = bestScore;
>  return labels.get(bestCategoryId);
>  }
>
>
>
> On 16 Oct 2013, at 07:15, Pat Cunnane <pcunnane@gmail.com> wrote:
>
> Thanks Andrew. I'd be interested to see what you're doing with the tfidf
> scores. If you could post some code that'd be awesome.
>
>
>
>
> On Wed, Oct 16, 2013 at 6:47 PM, Andrew Butkus <andrew@butkus.co.uk>wrote:
>
>> Ive been using the tfidf class to generate scores. I then use this
>> score to determine how good the classification is, if u need more info
>> say, and i can get u some code
>>
>> Sent from my Windows Phone From: Pat Cunnane
>> Sent: 15/10/2013 23:00
>> To: user@mahout.apache.org
>> Subject: Naive Bayes Classifier as a Recommender
>> Hi, I've got a dataset of millions of short documents (think twitter) that
>> can be in one of about 30,000 categories. When a user is creating a new
>> document, I want to suggest a list of 5 possible categories for that
>> document to go into.
>>
>> Right now I'm using the Naive Bayes classifier in mahout and sorting the
>> results by score. My problem is that sometimes the recommender is not very
>> accurate and I'd like to know:
>>
>> Is there any way to find out a confidence level for a classification?
>> Ideally then I could set a threshold and not display recommendations if
>> the
>> classifier is not confident.
>>
>> Also, would it be better to consider another algorithm to achieve my
>> goals?
>> I chose Naive Bayes because my dataset is pure text and very large. Any
>> thoughts would be greatly appreciated.
>>
>> Thanks,
>>
>> Pat
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message