mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Butkus <and...@butkus.co.uk>
Subject RE: Naive Bayes Classifier as a Recommender
Date Wed, 16 Oct 2013 10:04:24 GMT
  I thought this too, to start with, but do some reading on tfidf,

There is a stackoverflow question i asked about nb, and the info i got back
was as follows:

Basically (and correct me if wrong) by using tfifd, between classes, the
output score generated is automatically balanced. So a comparison can be
made between 2 classes as to which is best fit.

You can then use say the top 5 best scores to recommend to your user which
category / label to put into

Sent from my Windows Phone
 ------------------------------
From: Pat Cunnane <pcunnane@gmail.com>
Sent: 16/10/2013 09:55
To: Andrew Butkus <andrew@butkus.co.uk>
Cc: user@mahout.apache.org
Subject: Re: Naive Bayes Classifier as a Recommender

Hey thanks for posting that Andy.

In your example, bestScore is the highest score from the classifier's
results. The problem I see is that that score doesn't stay within a defined
range.
So I don't think you can set a fixed threshold for bestScore since the
range of scores changes per classification.


On Wed, Oct 16, 2013 at 9:00 PM, Andrew Butkus <andrew@butkus.co.uk> wrote:

> essentially you provide the labels, and naive model as inputs, and
> generate some scores. 'bestScore' will contain a value which determines how
> accurate the inputted data is, compared to your naive bays model, and this
> value is something you can use as a range for how good the match is ...
>
> Hope this helps, apologies for the code, it was taken from a model class i
> created so you will need to modify things around to get something working,
> but the API calls and logic is there.
>
> Andy
>
> public static Map<String, Integer> readDictionary(Configuration conf, Path
> dictionaryPath)
> {
>  Map<String, Integer> dictionnary = new HashMap<String, Integer>();
>
> for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text,
> IntWritable>(dictionaryPath, true, conf))
>  {
> dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
> }
>  return dictionnary;
> }
>
> public static Map<Integer, Long> readDocumentFrequency(Configuration conf,
> Path documentFrequencyPath)
>  {
> Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
>  for (Pair<IntWritable, LongWritable> pair : new
> SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath,
> true, conf))
> {
>  documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
> }
>  return documentFrequency;
> }
>
> /**
>  * Gets the best label for the input data
>  *
>  * @param  data  the data which will be classified
>  */
>  public String GetBestLabel(String data) throws IOException
> {
> StandardNaiveBayesClassifier classifier = new
> StandardNaiveBayesClassifier(m_model);
>
> // labels is a map label => classId
> Map<Integer, String> labels = BayesUtils.readLabelIndex(m_configuration,
> new Path(m_labelIndexPath));
>  Map<String, Integer> dictionary = readDictionary(m_configuration, new
> Path(m_dictionaryPath));
> Map<Integer, Long> documentFrequency =
> readDocumentFrequency(m_configuration, new Path(m_documentFrequencyPath));
>
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
> int documentCount = documentFrequency.get(-1).intValue();
>  Multiset<String> words = ConcurrentHashMultiset.create();
>  // extract words from text
>  TokenStream ts = analyzer.tokenStream("text", new StringReader(data));
>  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
>  ts.reset();
>  int wordCount = 0;
>  while (ts.incrementToken())
> {
> if (termAtt.length() > 0)
>  {
> String word = ts.getAttribute(CharTermAttribute.class).toString();
> Integer wordId = dictionary.get(word);
>  if (wordId != null)
> {
> words.add(word);
>  wordCount++;
> }
> }
> }
>
> // create vector wordId => weight using tfidf
> Vector vector = new RandomAccessSparseVector(10000);
>  TFIDF tfidf = new TFIDF();
>  for (Multiset.Entry<String> entry: words.entrySet())
>  {
> String word = entry.getElement();
>  int count = entry.getCount();
>  Integer wordId = dictionary.get(word);
> Long freq = documentFrequency.get(wordId);
>  double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount,
> documentCount);
>  vector.setQuick(wordId, tfIdfValue);
> }
>  Vector resultVector = classifier.classifyFull(vector);
>  double bestScore = -Double.MAX_VALUE;
> int bestCategoryId = -1;
>  for (Element element: resultVector)
> {
> int categoryId = element.index();
>  double score = element.get();
>  if (score > bestScore)
>  {
> bestScore = score;
> bestCategoryId = categoryId;
>  }
> //System.out.print("  " + labels.get(categoryId) + ": " + score);
> }
>  //System.out.println();
>
> analyzer.close();
>  m_score = bestScore;
>  return labels.get(bestCategoryId);
>  }
>
>
>
> On 16 Oct 2013, at 07:15, Pat Cunnane <pcunnane@gmail.com> wrote:
>
> Thanks Andrew. I'd be interested to see what you're doing with the tfidf
> scores. If you could post some code that'd be awesome.
>
>
>
>
> On Wed, Oct 16, 2013 at 6:47 PM, Andrew Butkus <andrew@butkus.co.uk>wrote:
>
>> Ive been using the tfidf class to generate scores. I then use this
>> score to determine how good the classification is, if u need more info
>> say, and i can get u some code
>>
>> Sent from my Windows Phone From: Pat Cunnane
>> Sent: 15/10/2013 23:00
>> To: user@mahout.apache.org
>> Subject: Naive Bayes Classifier as a Recommender
>> Hi, I've got a dataset of millions of short documents (think twitter) that
>> can be in one of about 30,000 categories. When a user is creating a new
>> document, I want to suggest a list of 5 possible categories for that
>> document to go into.
>>
>> Right now I'm using the Naive Bayes classifier in mahout and sorting the
>> results by score. My problem is that sometimes the recommender is not very
>> accurate and I'd like to know:
>>
>> Is there any way to find out a confidence level for a classification?
>> Ideally then I could set a threshold and not display recommendations if
>> the
>> classifier is not confident.
>>
>> Also, would it be better to consider another algorithm to achieve my
>> goals?
>> I chose Naive Bayes because my dataset is pure text and very large. Any
>> thoughts would be greatly appreciated.
>>
>> Thanks,
>>
>> Pat
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message