mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <drew.far...@gmail.com>
Subject Re: Array out of bounds in the KMeans driver
Date Sun, 20 Dec 2009 01:43:57 GMT
Does the IndexFiles class store term vectors for the contents field?
If not, that could be the problem.

Also, you can try dumping the vector file using
o.a.m.utils.vectors.VectorDumper in mahout-utils and taking a look to
see what's in there.

Failing that, in mahout-examples, you can run ./bin/build-reuters.sh
-- that will generate a known good set of vectors and you can try
running clustering upon that. No need to let build-reuters.sh to
complete, watch stdout and kill it once the vectors are done because
it will start running lda and you're not really interested in that at
this point. Once this is run, the vectors themselves can be found in
work/vectors, dictionary in work/dict.txt (relative to the
mahout-example directory)

On Sat, Dec 19, 2009 at 7:41 PM, Benson Margulies <bimargulies@gmail.com> wrote:
> So,
>
> I took the stock Lucene 'IndexFiles' class. I modified it to read
> UTF-8. I ran it.
>
> I ran the following:
>
> java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
> he_lucene_index \
>   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
>   --idField path
>
> and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
>

Mime
View raw message