# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Jonathan Cooper-Ellis <...@ziftr.com>
Subject Re: How to get document count for TFIDF calculate method?
Date Tue, 29 Jul 2014 18:22:43 GMT
```Hi Suneel,

Thanks for the response. Yes, I'm trying to determine it from the output of
seq2sparse. Here's the relevant excerpt from my code:

// Create a vector of wordId=>weight using tfidf.

Vector vector = new RandomAccessSparseVector(10000);

TFIDF tfidf = new TFIDF();

int documentCount = documentFrequency.get(-1).intValue(); // THIS IS
THROWING NPE

for (Multiset.Entry<String> entry : words.entrySet()) {

String word = entry.getElement();

int count = entry.getCount();

Integer wordId = dictionary.get(word);

Long freq = documentFrequency.get(wordId);

double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount,
documentCount);

vector.setQuick(wordId, tfIdfValue);

}

I'm working off this tutorial:

On Tue, Jul 29, 2014 at 1:50 PM, Suneel Marthi <suneel.marthi@gmail.com>
wrote:

> Have been silently following this discussion for sometime now. Jonathan if
> I understand u right, u r trying to determine the no. of docs in ur corpus.
> Correct?
>
> One of the artifactsfrom seq2sparse should have the doc count, not sure
> which one top of my head and I am not in front of a computer.
>
> The other quick way to determine the no. of docs would be to take the
> tf-idf vectors generated and feed them as input to RowId job.
> The output of RowId job are - matrix and docIndex.
>
> docIndex - mapping of document names to integerIds
> matrix - M * N matrix of M documents and N feature vectors
>
> docIndex should tell u the no. of documents in ur corpus.
>
> This is a quick and dirty way of doing it, I am sure there's a way to infer
> that from the o/p of seq2sparse itself (but I am not in from of my computer
> now).
>
>
>
> On Tue, Jul 29, 2014 at 10:40 AM, Jonathan Cooper-Ellis <jce@ziftr.com>
> wrote:
>
> > Hi Vaibhav,
> >
> > Thanks for the reply. It doesn't look like total count of keys in
> > frequency.file-0 corresponds to the number of documents, because I only
> > used a couple hundred documents to build the model and there are
> thousands
> > of keys in frequency.file-0. Am I misunderstanding something?
> >
> >
> > On Tue, Jul 29, 2014 at 1:15 PM, vaibhav srivastava <
> > vaibhavcse30@gmail.com>
> > wrote:
> >
> > > Hi if I am correct you want to know the number of documents by reading
> > > frequency.file-0; You can use the SequenceFileReader to load the
> > frequency
> > > file and then count the number of keys that will give you the number of
> > > documents.
> > > Hope this helps,
> > > Thanks,
> > > vaibhav
> > >
> > >
> > > On Tue, Jul 29, 2014 at 10:32 PM, Jonathan Cooper-Ellis <jce@ziftr.com
> >
> > > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I'm trying to make a Bayesian classifier, but I'm having a hard time
> > > > figuring out how to programatically determine the value of the
> numDocs
> > > > param for calculate method in TFIDF, using the files generated
> building
> > > the
> > > > model on the command line.
> > > >
> > > > I saw some code that did it like this:
> > > >
> > > > int numDocs = documentFrequency.get(-1).intValue();
> > > >
> > > > Where documentFrequency is a HashMap<Integer,Long> read from
> > > > frequency.file-0, but there's no key -1 in the file so its giving me
> an
> > > NPE
> > > > when I try to pass that to tfidf.calculate.
> > > >
> > > > Anyone know what I'm doing wrong?
> > > >
> > > >
> > > > Best,
> > > >
> > > > jce
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vaibhav Srivastava
> > > Email-id: vaibhavcse30@gmail.com
> > > Mobile no.: 9552543029
> > >
> >
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message