Using seqdumper on the TFIDF vectors, that vector is indeed in the list
Key: https://farfetchers.com/category/collections/source/briceberard:
Value: https://farfetchers.com/category/collections/source/briceberard:{
Looking in the seqfiles we find the document in part00005 of 10 in no
particular part of the file.
Key: https://farfetchers.com/category/collections/source/briceberard:
Value: ::Title::
Brice Berard  FarFetchers.com
Blog Posts
On the chance that this originates in seq2sparse I'll try changing
options until the vector looks different. and try clustering again.
On 6/5/12 10:43 AM, Pat Ferrel wrote:
> I'm not completely sure what I'm looking at but...
>
> In iterateSeq on iteration #1 of processing vectors/tfidfvectors it
> reads
> vector =
> "https://farfetchers.com/category/collections/source/briceberard:{"
>
> it's a named vector where the url is the name, the value is "{",
> which looks wrong and when that is classified to get a probability it
> gets
>
> probabilities =
> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
>
> That causes the probabilities.maxValueIndex() = 1 and everything dies.
>
> vector looks wrong, doesn't it? Truncated?
>
> I went back to try the same on mahout 0.6 but iterateSeq does not get
> called though I used xm sequential on both runs. I can't see
> kmeansclusters/clusters0 being created on mahout 0.6 either. Is that
> part of the refactoring?
>
> On 6/4/12 3:07 PM, Pat Ferrel wrote:
>> Some things to try:
>>  Have you verified the contents of your input vectors actually have
>> data in them?
>> * YES, from the other email you know that the data works fine in 0.6
>>  Can you run the cluster dumper on the b3/kmeansclusters/clusters0
>> contents?
>> * YES, It is attached from trunk's clusterdump after the failure of
>> kmeans, of course. A simple data set fortunately.
>>  Is it possible to run the sequential version (xm sequential)? If
>> it is you could run it in a debugger to gain more insight.
>> * YES, will report back.
>>
>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
>>> It looks like the probabilities vector returned by
>>> AbstractClusteringPolicy.classify() has no nonzero elements. In
>>> this case, AbstractClusteringPolicy.select()'s call to
>>> AbstractVector.maxValueIndex() is returning 1 and that is causing
>>> the exception.
>>>
>>> How could this happen? I'm not exactly sure, but consider that the
>>> probabilities vector is calculated in
>>> AbstractClusteringPolicy.classify() by calling
>>> DistanceMeasureCluster.pdf() on each of the prior clusters in
>>> b3/kmeansclusters/clusters0. With a CosineDistanceMeasure I don't
>>> see how this could ever return zero. Certainly, some of your vectors
>>> will match the prior cluster centers exactly (they were sampled from
>>> the input) and those values would return pdf==1. Even if the cosine
>>> distance was 1 the pdf would be 0.5.
>>>
>>> Some things to try:
>>>  Have you verified the contents of your input vectors actually have
>>> data in them?
>>>  Can you run the cluster dumper on the
>>> b3/kmeansclusters/clusters0 contents?
>>>  Is it possible to run the sequential version (xm sequential)? If
>>> it is you could run it in a debugger to gain more insight.
>>>
>>> Jeff
>>>
>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
>>>> Using the CLI to kmeans from several trunk versions I get an error
>>>> I don't understand. When the job died the
>>>> b3/canopycentroids/clusters0final contained the randomseeds
>>>> file generated by the kmeans driver and the
>>>> b3/kmeansclusters/clusters0 had several part files but
>>>> b3/kmeansclusters/clusters1 was empty. When I look through the
>>>> code from the trace it doesn't make much sense.
>>>>
>>>> Command line:
>>>> mahout kmeans
>>>> i b3/vectors/tfidfvectors/
>>>> k 20
>>>> c b3/canopycentroids/clusters0final
>>>> cl
>>>> o b3/kmeansclusters
>>>> ow
>>>> cd 0.01
>>>> x 30
>>>> dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>
>>>> Error:
>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments:
>>>> {clustering=null,
>>>> clusters=[b3/canopycentroids/clusters0final],
>>>> convergenceDelta=[0.01],
>>>> distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
>>>> endPhase=[2147483647], input=[b3/vectors/tfidfvectors/],
>>>> maxIter=[30], method=[mapreduce], numClusters=[20],
>>>> output=[b3/kmeansclusters], overwrite=null, startPhase=[0],
>>>> tempDir=[temp]}
>>>> 20120604 07:55:03.752 java[67308:1903] Unable to load realm info
>>>> from SCDynamicStore
>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting
>>>> b3/canopycentroids/clusters0final
>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load
>>>> nativehadoop library for your platform... using builtinjava
>>>> classes where applicable
>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brandnew compressor
>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors
>>>> to b3/canopycentroids/clusters0final/partrandomSeed
>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input:
>>>> b3/vectors/tfidfvectors Clusters In:
>>>> b3/canopycentroids/clusters0final/partrandomSeed Out:
>>>> b3/kmeansclusters Distance:
>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max
>>>> Iterations: 30 num Reduce Tasks:
>>>> org.apache.mahout.math.VectorWritable Input Vectors: {}
>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brandnew decompressor
>>>> Cluster Iterator running iteration 1 over priorPath:
>>>> b3/kmeansclusters/clusters0
>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to
>>>> process : 1
>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
>>>> 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0%
>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>>>> org.apache.mahout.math.IndexException: Index 1 is outside
>>>> allowable range of [0,20)
>>>> at
>>>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>>> at
>>>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>>> at
>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>>>> at
>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>> at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>>>
>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
>>>> Exception in thread "main" java.lang.InterruptedException: Cluster
>>>> Iteration 1 failed processing b3/kmeansclusters/clusters1
>>>> at
>>>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>>>> at
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>>>> at
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>>>> at
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>> at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>> at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
