mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Problem using SNAPSHOT kmeans
Date Mon, 04 Jun 2012 21:19:11 GMT
It looks like the probabilities vector returned by 
AbstractClusteringPolicy.classify() has no non-zero elements. In this 
case, AbstractClusteringPolicy.select()'s call to 
AbstractVector.maxValueIndex() is returning -1 and that is causing the 
exception.

How could this happen? I'm not exactly sure, but consider that the 
probabilities vector is calculated in 
AbstractClusteringPolicy.classify() by calling 
DistanceMeasureCluster.pdf() on each of the prior clusters in 
b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see 
how this could ever return zero. Certainly, some of your vectors will 
match the prior cluster centers exactly (they were sampled from the 
input) and those values would return pdf==1. Even if the cosine distance 
was 1 the pdf would be 0.5.

Some things to try:
- Have you verified the contents of your input vectors actually have 
data in them?
- Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 
contents?
- Is it possible to run the sequential version (-xm sequential)? If it 
is you could run it in a debugger to gain more insight.

Jeff

On 6/4/12 12:05 PM, Pat Ferrel wrote:
> Using the CLI to kmeans from several trunk versions I get an error I 
> don't understand.  When the job died the 
> b3/canopy-centroids/clusters-0-final contained the random-seeds file 
> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 
> had several part files but b3/kmeans-clusters/clusters-1 was empty. 
> When I look through the code from the trace it doesn't make much sense.
>
> Command line:
> mahout kmeans
>   -i b3/vectors/tfidf-vectors/
>   -k 20
>   -c b3/canopy-centroids/clusters-0-final
>   -cl
>   -o b3/kmeans-clusters
>   -ow
>   -cd 0.01
>   -x 30
>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>
> Error:
> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: 
> {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], 
> --convergenceDelta=[0.01], 
> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], 
> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], 
> --maxIter=[30], --method=[mapreduce], --numClusters=[20], 
> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], 
> --tempDir=[temp]}
> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info 
> from SCDynamicStore
> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting 
> b3/canopy-centroids/clusters-0-final
> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to 
> b3/canopy-centroids/clusters-0-final/part-randomSeed
> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: 
> b3/vectors/tfidf-vectors Clusters In: 
> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: 
> b3/kmeans-clusters Distance: 
> org.apache.mahout.common.distance.CosineDistanceMeasure
> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max 
> Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable 
> Input Vectors: {}
> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
> Cluster Iterator running iteration 1 over priorPath: 
> b3/kmeans-clusters/clusters-0
> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to 
> process : 1
> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
> org.apache.mahout.math.IndexException: Index -1 is outside allowable 
> range of [0,20)
>     at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>     at 
> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>     at 
> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>     at 
> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
> Exception in thread "main" java.lang.InterruptedException: Cluster 
> Iteration 1 failed processing b3/kmeans-clusters/clusters-1
>     at 
> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>     at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>     at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>     at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>     at 
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>
>
>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message