mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Cooper-Ellis <...@ziftr.com>
Subject Re: How to get document count for TFIDF calculate method?
Date Tue, 29 Jul 2014 18:48:46 GMT
Hello again,

Looks like I figured out what the problem was. I was supposed to be using
df-count, and not frequency.file-0. df-count does have a key of -1 with a
value that looks like the total number of documents.

Thanks again for the responses.


On Tue, Jul 29, 2014 at 2:22 PM, Jonathan Cooper-Ellis <jce@ziftr.com>
wrote:

> Hi Suneel,
>
> Thanks for the response. Yes, I'm trying to determine it from the output
> of seq2sparse. Here's the relevant excerpt from my code:
>
> // Create a vector of wordId=>weight using tfidf.
>
> Vector vector = new RandomAccessSparseVector(10000);
>
> TFIDF tfidf = new TFIDF();
>
> int documentCount = documentFrequency.get(-1).intValue(); // THIS IS
> THROWING NPE
>
> for (Multiset.Entry<String> entry : words.entrySet()) {
>
> String word = entry.getElement();
>
> int count = entry.getCount();
>
> Integer wordId = dictionary.get(word);
>
> Long freq = documentFrequency.get(wordId);
>
> double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount,
> documentCount);
>
> vector.setQuick(wordId, tfIdfValue);
>
> }
>
>
> I'm working off this tutorial:
> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
>
>
> On Tue, Jul 29, 2014 at 1:50 PM, Suneel Marthi <suneel.marthi@gmail.com>
> wrote:
>
>> Have been silently following this discussion for sometime now. Jonathan if
>> I understand u right, u r trying to determine the no. of docs in ur
>> corpus.
>> Correct?
>>
>> One of the artifactsfrom seq2sparse should have the doc count, not sure
>> which one top of my head and I am not in front of a computer.
>>
>> The other quick way to determine the no. of docs would be to take the
>> tf-idf vectors generated and feed them as input to RowId job.
>> The output of RowId job are - matrix and docIndex.
>>
>> docIndex - mapping of document names to integerIds
>> matrix - M * N matrix of M documents and N feature vectors
>>
>> docIndex should tell u the no. of documents in ur corpus.
>>
>> This is a quick and dirty way of doing it, I am sure there's a way to
>> infer
>> that from the o/p of seq2sparse itself (but I am not in from of my
>> computer
>> now).
>>
>>
>>
>> On Tue, Jul 29, 2014 at 10:40 AM, Jonathan Cooper-Ellis <jce@ziftr.com>
>> wrote:
>>
>> > Hi Vaibhav,
>> >
>> > Thanks for the reply. It doesn't look like total count of keys in
>> > frequency.file-0 corresponds to the number of documents, because I only
>> > used a couple hundred documents to build the model and there are
>> thousands
>> > of keys in frequency.file-0. Am I misunderstanding something?
>> >
>> >
>> > On Tue, Jul 29, 2014 at 1:15 PM, vaibhav srivastava <
>> > vaibhavcse30@gmail.com>
>> > wrote:
>> >
>> > > Hi if I am correct you want to know the number of documents by reading
>> > > frequency.file-0; You can use the SequenceFileReader to load the
>> > frequency
>> > > file and then count the number of keys that will give you the number
>> of
>> > > documents.
>> > > Hope this helps,
>> > > Thanks,
>> > > vaibhav
>> > >
>> > >
>> > > On Tue, Jul 29, 2014 at 10:32 PM, Jonathan Cooper-Ellis <
>> jce@ziftr.com>
>> > > wrote:
>> > >
>> > > > Hey guys,
>> > > >
>> > > > I'm trying to make a Bayesian classifier, but I'm having a hard time
>> > > > figuring out how to programatically determine the value of the
>> numDocs
>> > > > param for calculate method in TFIDF, using the files generated
>> building
>> > > the
>> > > > model on the command line.
>> > > >
>> > > > I saw some code that did it like this:
>> > > >
>> > > > int numDocs = documentFrequency.get(-1).intValue();
>> > > >
>> > > > Where documentFrequency is a HashMap<Integer,Long> read from
>> > > > frequency.file-0, but there's no key -1 in the file so its giving
>> me an
>> > > NPE
>> > > > when I try to pass that to tfidf.calculate.
>> > > >
>> > > > Anyone know what I'm doing wrong?
>> > > >
>> > > >
>> > > > Best,
>> > > >
>> > > > jce
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks and Regards,
>> > > Vaibhav Srivastava
>> > > Email-id: vaibhavcse30@gmail.com
>> > > Mobile no.: 9552543029
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message