mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <d...@apache.org>
Subject Re: Problems running examples
Date Thu, 09 Jun 2011 15:30:22 GMT
Sean, I'd be surprised to find out that k-means was busted. It was
working just prior to release 0.5 when I was working on
https://issues.apache.org/jira/browse/MAHOUT-694 which may be related
to Mark's problems, but then again I haven't been tracking the other
patches that were applied around that timeframe or have been applied
since.

I can't really comment on Hector's issue without first determining why
BuildRandom would be attempting to cast IntWritables to
VectorWritables -- but I'll have to treat that as a separate problem
for now.

Mark, it seems that the core of your problem, as you state, is that
seqdirectory is expecting its input to be in HDFS and not on the local
disk. It seems that MAHOUT_LOCAL=true, that I added in MAHOUT-694 to
force seqdirectory to read the data prepared by ExtractReuters from
the local disk isn't doing the right thing in your environment (or at
all for that matter)

If you manually copy the output from ExtractReuters up to the
mahout-work/reuters-out directory on your HDFS and delete
mahout-work/reuters-out-seqdir from HDFS, and then run:

MAHOUT_LOCAL=true $MAHOUT seqdirectory \
       -i mahout-work/reuters-out \
       -o mahout-work/reuters-out-seqdir \
       -c UTF-8 -chunk 5

Does seqdirectory produce non-empty output in the reuters-out-seqdir
directory? It would also be useful to determine if the bin/mahout
script is actually executing using a java vm, or using the hadoop
script, because that is likely a factor here.

Lance pointed out a couple days ago that running bin/mahout as 'sh -x
bin/mahout' will cause it to emit debug info that indicates precisely
what the script is doing, so try:

MAHOUT_LOCAL=true sh -x ./bin/mahout

and

sh -x ./bin/mahout

If you are running on hadoop, you will see a line like :

+ exec /opt/hadoop-0.20.2/bin/hadoop jar
/home/drew/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
org.apache.mahout.driver.MahoutDriver

Otherwise if it is running locally:

+ exec /opt/j2sdk/bin/java -Xmx1000m -Dhadoop.log.dir [..classpath..]
org.apache.mahout.driver.MahoutDriver

If you do discover you are running in local mode, try clearing out the
values of HADOOP_CONF_DIR or HADOOP_HOME environment variables,
cleaning the data off of and re-running.

I'll spend some time trying things out and let you know what I run into as well.

Drew

On Thu, Jun 9, 2011 at 10:40 AM, Mark <static.void.dev@gmail.com> wrote:
> Forgot to mention... great book :)
>
> On 6/9/11 7:30 AM, Mark wrote:
>>
>> KMeans is busted? What do you mean by this? The algorithm simply won't
>> work or just the reuters example?
>>
>> Thanks
>>
>> On 6/9/11 12:28 AM, Sean Owen wrote:
>>>
>>> (Assuming you are on HEAD,) I think KMeans is busted -- this has come up
>>> before. I don't know if it is being maintained.  Anyone who's willing to
>>> step up and fix it is also welcome to overhaul it IMHO.
>>>
>>> On Thu, Jun 9, 2011 at 12:03 AM, Hector Yee<hector.yee@gmail.com>  wrote:
>>>
>>>> I got a slightly different error on the next line of KMeansDriver.java
>>>> (running on OS X Snow Leopard)
>>>>
>>>> 11/06/08 16:02:12 INFO compress.CodecPool: Got brand-new compressor
>>>> Exception in thread "main" java.lang.ClassCastException:
>>>> org.apache.hadoop.io.IntWritable cannot be cast to
>>>> org.apache.mahout.math.VectorWritable
>>>>  at
>>>>
>>>>
>>>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:90)
>>>> at
>>>>
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:102)
>>>>
>>>>
>>>> On Sun, Jun 5, 2011 at 9:31 PM, Jeff Eastman<jeastman@narus.com>  wrote:
>>>>
>>>>> IIRC, Reuters used to run on a cluster but no longer does due to some
>>>>> obscure Lucene changes. In 0.5 it only works in local mode. I really
>>>>> hope
>>>>> this can be repaired by 0.6 as Reuters is a key entry point into Mahout
>>>>> clustering for many users.
>>>>>
>

Mime
View raw message