mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Problem using SNAPSHOT kmeans
Date Wed, 06 Jun 2012 13:53:25 GMT
Yes, it looks like the input vectors are empty and this is the source of 
the error. I'm troubled; however, that empty vectors can have this 
impact on k-means. I'm going to write a unit test to see if I can 
duplicate this exception.

On 6/5/12 3:12 PM, Pat Ferrel wrote:
> I think I found the root but not sure what needs fixing.
>
> I took out n-gram generation and the vector now looks like this:
> Key: https://farfetchers.com/category/collections/source/brice-berard:
> Value: 
> https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
>
> This works in clustering.
>
> It doesn't seem like a malformed vector should crash clustering (it 
> apparently doesn't in mahout 0.6) but it looks like something in 
> seq2sparse's n-gram weighting does cause a malformed vector.
>
> I'll file a JIRA
>
> On 6/5/12 11:48 AM, Pat Ferrel wrote:
>> Using seqdumper on the TFIDF vectors, that vector is indeed in the list
>> Key: https://farfetchers.com/category/collections/source/brice-berard:
>> Value: 
>> https://farfetchers.com/category/collections/source/brice-berard:{
>>
>> Looking in the seqfiles we find the document in part-00005 of 10 in 
>> no particular part of the file.
>> Key: https://farfetchers.com/category/collections/source/brice-berard:
>> Value: ::Title::
>> Brice Berard | FarFetchers.com
>> Blog Posts
>>
>> On the chance that this originates in seq2sparse I'll try changing 
>> options until the vector looks different. and try clustering again.
>>
>> On 6/5/12 10:43 AM, Pat Ferrel wrote:
>>> I'm not completely sure what I'm looking at but...
>>>
>>> In iterateSeq on iteration #1  of processing vectors/tfidf-vectors 
>>> it reads
>>> vector = 
>>> "https://farfetchers.com/category/collections/source/brice-berard:{"
>>>
>>> it's a named vector where the  url is the name, the value is "{", 
>>> which looks wrong and when that is classified to get a probability 
>>> it gets
>>>
>>> probabilities = 
>>> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
>>>
>>> That causes the probabilities.maxValueIndex() = -1 and everything dies.
>>>
>>> vector looks wrong, doesn't it? Truncated?
>>>
>>> I went back to try the same on mahout 0.6 but iterateSeq does not 
>>> get called though I used -xm sequential on both runs. I can't see 
>>> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is 
>>> that part of the refactoring?
>>>
>>> On 6/4/12 3:07 PM, Pat Ferrel wrote:
>>>> Some things to try:
>>>> - Have you verified the contents of your input vectors actually 
>>>> have data in them?
>>>> * YES, from the other email you know that the data works fine in 0.6
>>>> - Can you run the cluster dumper on the 
>>>> b3/kmeans-clusters/clusters-0 contents?
>>>> * YES, It is attached from trunk's clusterdump after the failure of 
>>>> kmeans, of course. A simple data set fortunately.
>>>> - Is it possible to run the sequential version (-xm sequential)? If 
>>>> it is you could run it in a debugger to gain more insight.
>>>> * YES, will report back.
>>>>
>>>> On 6/4/12 2:19 PM, Jeff Eastman wrote:
>>>>> It looks like the probabilities vector returned by 
>>>>> AbstractClusteringPolicy.classify() has no non-zero elements. In 
>>>>> this case, AbstractClusteringPolicy.select()'s call to 
>>>>> AbstractVector.maxValueIndex() is returning -1 and that is causing 
>>>>> the exception.
>>>>>
>>>>> How could this happen? I'm not exactly sure, but consider that the 
>>>>> probabilities vector is calculated in 
>>>>> AbstractClusteringPolicy.classify() by calling 
>>>>> DistanceMeasureCluster.pdf() on each of the prior clusters in 
>>>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I 
>>>>> don't see how this could ever return zero. Certainly, some of your 
>>>>> vectors will match the prior cluster centers exactly (they were 
>>>>> sampled from the input) and those values would return pdf==1. Even 
>>>>> if the cosine distance was 1 the pdf would be 0.5.
>>>>>
>>>>> Some things to try:
>>>>> - Have you verified the contents of your input vectors actually 
>>>>> have data in them?
>>>>> - Can you run the cluster dumper on the 
>>>>> b3/kmeans-clusters/clusters-0 contents?
>>>>> - Is it possible to run the sequential version (-xm sequential)? 
>>>>> If it is you could run it in a debugger to gain more insight.
>>>>>
>>>>> Jeff
>>>>>
>>>>> On 6/4/12 12:05 PM, Pat Ferrel wrote:
>>>>>> Using the CLI to kmeans from several trunk versions I get an 
>>>>>> error I don't understand.  When the job died the 
>>>>>> b3/canopy-centroids/clusters-0-final contained the random-seeds 
>>>>>> file generated by the kmeans driver and the 
>>>>>> b3/kmeans-clusters/clusters-0 had several part files but 
>>>>>> b3/kmeans-clusters/clusters-1 was empty. When I look through the

>>>>>> code from the trace it doesn't make much sense.
>>>>>>
>>>>>> Command line:
>>>>>> mahout kmeans
>>>>>>   -i b3/vectors/tfidf-vectors/
>>>>>>   -k 20
>>>>>>   -c b3/canopy-centroids/clusters-0-final
>>>>>>   -cl
>>>>>>   -o b3/kmeans-clusters
>>>>>>   -ow
>>>>>>   -cd 0.01
>>>>>>   -x 30
>>>>>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> Error:
>>>>>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line 
>>>>>> arguments: {--clustering=null, 
>>>>>> --clusters=[b3/canopy-centroids/clusters-0-final], 
>>>>>> --convergenceDelta=[0.01], 
>>>>>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],

>>>>>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], 
>>>>>> --maxIter=[30], --method=[mapreduce], --numClusters=[20], 
>>>>>> --output=[b3/kmeans-clusters], --overwrite=null, 
>>>>>> --startPhase=[0], --tempDir=[temp]}
>>>>>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm 
>>>>>> info from SCDynamicStore
>>>>>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting 
>>>>>> b3/canopy-centroids/clusters-0-final
>>>>>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load 
>>>>>> native-hadoop library for your platform... using builtin-java 
>>>>>> classes where applicable
>>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
>>>>>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 
>>>>>> vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed
>>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: 
>>>>>> b3/vectors/tfidf-vectors Clusters In: 
>>>>>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: 
>>>>>> b3/kmeans-clusters Distance: 
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max

>>>>>> Iterations: 30 num Reduce Tasks: 
>>>>>> org.apache.mahout.math.VectorWritable Input Vectors: {}
>>>>>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new 
>>>>>> decompressor
>>>>>> Cluster Iterator running iteration 1 over priorPath: 
>>>>>> b3/kmeans-clusters/clusters-0
>>>>>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths 
>>>>>> to process : 1
>>>>>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
>>>>>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 
>>>>>> 79691776/99614720
>>>>>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
>>>>>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
>>>>>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>>>>>> org.apache.mahout.math.IndexException: Index -1 is outside 
>>>>>> allowable range of [0,20)
>>>>>>     at 
>>>>>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>>>>>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>>>>     at 
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>     at 
>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

>>>>>>
>>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: 
>>>>>> job_local_0001
>>>>>> 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
>>>>>> Exception in thread "main" java.lang.InterruptedException: 
>>>>>> Cluster Iteration 1 failed processing b3/kmeans-clusters/clusters-1
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
>>>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>     at 
>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
>>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>     at 
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>     at 
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>     at 
>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>     at 
>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>     at 
>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message