mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel McEnnis <dmcen...@gmail.com>
Subject Re: Create vector using existing dictionary and IDF values
Date Sun, 17 Apr 2011 16:09:17 GMT
Julian,

You're using a dictionary that has only the values seen in the
training set.  Once you execute with a different document, you may
have entries that are present in the new set but not in the old.
Unless you deal with this case specifically, they will generate
IndexOutOfBounds or NullPointer errors depending on how you implement
the dictionary.

Daniel

On Sun, Apr 17, 2011 at 3:09 AM, Julian Limon <julian.limon@tukipa.com> wrote:
> Hello all,
>
> Sorry to bother again, but I've been hitting my head against the wall for
> the last day and I don't seem to find the answer.
>
> I'm trying to create a new tfidf vector (or probably many vectors) out of a
> new directory using something like seq2sparse. However, I want to create
> these vectors based on the dictionary and idf values of a previously
> executed directory. Let's say that I created my vectors using the whole
> corpus and now I want to calculate new tfidf vectors for a few documents (or
> more exactly, a few queries) that share the properties of the previous
> corpus.
>
> I know that seq2sparse stores a dictionary and tf values in temporary
> folders. My first attempt was to modify DictionaryVectorizer and
> TFIDFConverter to have them use a dictionary and a df-count from a different
> directory. So far it seems that I had some luck with both, but now I'm
> getting "index out of bound" exception. My guess is that some other class or
> job determines the size of some array based on the document source.
>
> Do you guys have any ideas about what might be wrong? Or even better, do you
> guys know of a better way to generate a vector (i.e., a query vector) using
> previous matrix values (i.e., the index)?
>
> Thanks a lot,
>
> Julian
>
> P.S. The error I'm getting looks like this:
>
> Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
> WARNING: job_local_0002
> org.apache.mahout.math.IndexException: Index 517 is outside allowable range
> of [0,0)
> at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:392)
> at
> org.apache.mahout.math.SequentialAccessSparseVector.<init>(SequentialAccessSparseVector.java:69)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:95)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.reduce(TFPartialVectorReducer.java:50)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> Apr 17, 2011 12:05:31 AM org.apache.hadoop.mapred.JobClient
> monitorAndPrintJob
> INFO:  map 100% reduce 0%
>

Mime
View raw message