mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Array out of bounds in the KMeans driver
Date Sun, 20 Dec 2009 02:42:55 GMT
It didn't have term vectors.

On Sat, Dec 19, 2009 at 8:43 PM, Drew Farris <drew.farris@gmail.com> wrote:
> Does the IndexFiles class store term vectors for the contents field?
> If not, that could be the problem.
>
> Also, you can try dumping the vector file using
> o.a.m.utils.vectors.VectorDumper in mahout-utils and taking a look to
> see what's in there.
>
> Failing that, in mahout-examples, you can run ./bin/build-reuters.sh
> -- that will generate a known good set of vectors and you can try
> running clustering upon that. No need to let build-reuters.sh to
> complete, watch stdout and kill it once the vectors are done because
> it will start running lda and you're not really interested in that at
> this point. Once this is run, the vectors themselves can be found in
> work/vectors, dictionary in work/dict.txt (relative to the
> mahout-example directory)
>
> On Sat, Dec 19, 2009 at 7:41 PM, Benson Margulies <bimargulies@gmail.com> wrote:
>> So,
>>
>> I took the stock Lucene 'IndexFiles' class. I modified it to read
>> UTF-8. I ran it.
>>
>> I ran the following:
>>
>> java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
>> he_lucene_index \
>>   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
>>   --idField path
>>
>> and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
>>
>

Mime
View raw message