mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: RowSimilarity incorrectly setting 'size'
Date Sat, 08 Sep 2012 03:08:14 GMT
Yes, the indices of the vector are item IDs, which in theory can take
on any value. In practice that's limited to the range of longs in
Java. And in more practice, that's limited to the range of
nonnnegative ints since they are used as Vector indices. This is why
the dimensionality is conceptually infinite, and set to
Integer.MAX_VALUE in practice.

On Fri, Sep 7, 2012 at 11:49 PM, Anna Lahoud <annalahoud@gmail.com> wrote:
> I am running a RowSimilarityJob with a large dataset. When I call
> Vector.size() on the resulting vector, it always returns Integer.MAX_VALUE.
> At first I thought maybe I really did end up with a cardinality that
> outsized the int. Upon further checking, I found that the rowid vector
> cardinality was correct. It is only the vectors after the RowId job that
> have an invalid size.
>
> I did some looking into the job's temp directory (which in my Mahout V0.6
> still exists after the job). Both the cooccurrence and the weight outputs
> are also set to size=Integer.MAX_VALUE.
>
> In searching for the problem, I found that the VectorNormMapper, which is
> the first of three job that run, the vector is created with the following
> line:
>
> RandomAccessSparseVector partialColumnVector = new
> RandomAccessSparseVector(Integer.MAX_VALUE);
>
> which sets the size for the vector to the Integer.MAX_VALUE. I believe that
> is then carried through to the remaining vectors throughout the jobs.
>
> I don't know if this is a known bug or not.
>
> Thanks,
>
> Anna

Mime
View raw message