mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Lahoud <annalah...@gmail.com>
Subject RowSimilarity incorrectly setting 'size'
Date Fri, 07 Sep 2012 22:49:25 GMT
I am running a RowSimilarityJob with a large dataset. When I call
Vector.size() on the resulting vector, it always returns Integer.MAX_VALUE.
At first I thought maybe I really did end up with a cardinality that
outsized the int. Upon further checking, I found that the rowid vector
cardinality was correct. It is only the vectors after the RowId job that
have an invalid size.

I did some looking into the job's temp directory (which in my Mahout V0.6
still exists after the job). Both the cooccurrence and the weight outputs
are also set to size=Integer.MAX_VALUE.

In searching for the problem, I found that the VectorNormMapper, which is
the first of three job that run, the vector is created with the following
line:

RandomAccessSparseVector partialColumnVector = new
RandomAccessSparseVector(Integer.MAX_VALUE);

which sets the size for the vector to the Integer.MAX_VALUE. I believe that
is then carried through to the remaining vectors throughout the jobs.

I don't know if this is a known bug or not.

Thanks,

Anna

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message