mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Problem using SNAPSHOT kmeans
Date Mon, 04 Jun 2012 22:07:57 GMT
Some things to try:
- Have you verified the contents of your input vectors actually have 
data in them?
* YES, from the other email you know that the data works fine in 0.6
- Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 
contents?
* YES, It is attached from trunk's clusterdump after the failure of 
kmeans, of course. A simple data set fortunately.
- Is it possible to run the sequential version (-xm sequential)? If it 
is you could run it in a debugger to gain more insight.
* YES, will report back.

On 6/4/12 2:19 PM, Jeff Eastman wrote:
> It looks like the probabilities vector returned by 
> AbstractClusteringPolicy.classify() has no non-zero elements. In this 
> case, AbstractClusteringPolicy.select()'s call to 
> AbstractVector.maxValueIndex() is returning -1 and that is causing the 
> exception.
>
> How could this happen? I'm not exactly sure, but consider that the 
> probabilities vector is calculated in 
> AbstractClusteringPolicy.classify() by calling 
> DistanceMeasureCluster.pdf() on each of the prior clusters in 
> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't 
> see how this could ever return zero. Certainly, some of your vectors 
> will match the prior cluster centers exactly (they were sampled from 
> the input) and those values would return pdf==1. Even if the cosine 
> distance was 1 the pdf would be 0.5.
>
> Some things to try:
> - Have you verified the contents of your input vectors actually have 
> data in them?
> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 
> contents?
> - Is it possible to run the sequential version (-xm sequential)? If it 
> is you could run it in a debugger to gain more insight.
>
> Jeff
>
> On 6/4/12 12:05 PM, Pat Ferrel wrote:
>> Using the CLI to kmeans from several trunk versions I get an error I 
>> don't understand.  When the job died the 
>> b3/canopy-centroids/clusters-0-final contained the random-seeds file 
>> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 
>> had several part files but b3/kmeans-clusters/clusters-1 was empty. 
>> When I look through the code from the trace it doesn't make much sense.
>>
>> Command line:
>> mahout kmeans
>>   -i b3/vectors/tfidf-vectors/
>>   -k 20
>>   -c b3/canopy-centroids/clusters-0-final
>>   -cl
>>   -o b3/kmeans-clusters
>>   -ow
>>   -cd 0.01
>>   -x 30
>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> Error:
>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: 
>> {--clustering=null, 
>> --clusters=[b3/canopy-centroids/clusters-0-final], 
>> --convergenceDelta=[0.01], 
>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], 
>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], 
>> --maxIter=[30], --method=[mapreduce], --numClusters=[20], 
>> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], 
>> --tempDir=[temp]}
>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info 
>> from SCDynamicStore
>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting 
>> b3/canopy-centroids/clusters-0-final
>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load 
>> native-hadoop library for your platform... using builtin-java classes 
>> where applicable
>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors 
>> to b3/canopy-centroids/clusters-0-final/part-randomSeed
>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: 
>> b3/vectors/tfidf-vectors Clusters In: 
>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: 
>> b3/kmeans-clusters Distance: 
>> org.apache.mahout.common.distance.CosineDistanceMeasure
>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max 
>> Iterations: 30 num Reduce Tasks: 
>> org.apache.mahout.math.VectorWritable Input Vectors: {}
>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
>> Cluster Iterator running iteration 1 over priorPath: 
>> b3/kmeans-clusters/clusters-0
>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to 
>> process : 1
>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>> org.apache.mahout.math.IndexException: Index -1 is outside allowable 
>> range of [0,20)
>>     at 
>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>     at 
>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>     at 
>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>>     at 
>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>     at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
>> Exception in thread "main" java.lang.InterruptedException: Cluster 
>> Iteration 1 failed processing b3/kmeans-clusters/clusters-1
>>     at 
>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>>     at 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>>     at 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>>     at 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at 
>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>     at 
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>
>>
>>
>>
>>
>

Mime
View raw message