mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Tranforming data for k-means analysis
Date Wed, 15 Sep 2010 00:04:51 GMT
  Hmn, this may also be caused by simply using the wrong input path to 
kmeans. It should be the directory containing your input vector sequence 
files, not the name of the file itself. This code is the first place in 
kmeans that the input path is queried. Judging by the fact that the 
index exception was for index 0 I don't think it found any of your 
sequence files.

We've had similar reports before of this nature. I'll see if I can 
create a unit test to duplicate your problem and then fix it.

On 9/14/10 4:43 PM, Jeff Eastman wrote:
>  Hi Radek,
>
> Looking over your mapper code it looks mostly ok but I am curious why 
> you are writing the vector size in the context write's Text argument? 
> Aren't they all the same size? The Mahout document processing jobs 
> generally put the document ID in the key slot (and also sometimes in a 
> NV in the value slot). If you look at line 107 in the file; however, 
> you will see that the exception is likely the result of 
> "chosenTexts.get(i)". Looking upward at the reader loop above, it is 
> scanning through all the input vectors so the only way I can see a 
> bounds exception is if "k" is greater than the number of input 
> vectors. What value did you specify in mahout kmeans?
>
> Have you tried running this in a debugger? Maybe a simple test to 
> check that k > chosenTexts.size (and also chosenClusters)? Probably by 
> now you've found the cause on your own...
>
>
> On 9/14/10 3:45 PM, Radek Maciaszek wrote:
>> Hi Jeff,
>>
>> Thanks again for your help, I am starting to see the light on the end 
>> of my
>> MSc eventually. I think I broke something in Mahout again ;) Due to the
>> number of dimensions (around 14,000) I need to use sparse vectors 
>> (which are
>> wrapped inside namedvectors). I used the logic from syntheticdata
>> InputMapper and the sequence files appear to be generated correctly - 
>> well
>> at least I cannot see any errors in that process. However as soon as 
>> I am
>> trying to pass that data to k-means clustering the 
>> RandomSeedGenerator class
>> gives me following error:
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: 
>> Index: 0,
>> Size: 0
>>          at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>          at java.util.ArrayList.get(ArrayList.java:322)
>>          at
>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)

>>
>>
>> It appears that the following line generates the 
>> IndexOutOfBoundsException:
>> writer.append(chosenTexts.get(i), chosenClusters.get(i));
>>
>> I believe this is probably because the sequence files are not created
>> correctly or perhaps the code in RandomSeedGenerator is not 
>> compatible with
>> the sequence which I used... Here is a quick essence of the code I am 
>> using
>> to create sequence file, that is the map() method from InputMapper:
>>
>>    protected void map(LongWritable key, Text values, Context context) 
>> throws
>> IOException, InterruptedException {
>>
>>      String[] numbers = InputMapper.SPACE.split(values.toString());
>>      SequentialAccessSparseVector sparseVector = null;
>>      String keyName = "";
>>      int vectorSize = -1;
>>      for (String value : numbers) {
>>        if (keyName.equals("")) {
>>            keyName = value;
>>            continue;
>>        } else if (vectorSize == -1) {
>>            vectorSize = Integer.parseInt(value);
>>            sparseVector = new SequentialAccessSparseVector(vectorSize);
>>            continue;
>>        } else if (value.length()>  0) {
>>            String[] valuePair = InputMapper.COLON.split(value);
>>            if (!valuePair[1].equals("NULL")) {
>>              sparseVector.setQuick(Integer.parseInt(valuePair[0]),
>> Double.valueOf(valuePair[1]));
>>            }
>>        }
>>      }
>>      try {
>>        Vector result = new NamedVector(sparseVector, keyName);
>>        VectorWritable vectorWritable = new VectorWritable(result);
>>        context.write(new Text(String.valueOf(vectorSize)), 
>> vectorWritable);
>>
>>      } catch (Exception e) {
>>        throw new IllegalStateException(e);
>>      }
>>    }
>>
>> My input data which mappers analyzes is in the format:
>> ID NumberOfDimensionsInVector Index1:Value1 Index5:Value5 
>> IndexY:ValueY...
>>
>> I am still trying to get my head around the implementation details of 
>> Mahout
>> and I find it a bit difficult to debug some things. Thank you in 
>> advance for
>> any tips.
>>
>> Best,
>> Radek
>>
>> On 9 September 2010 16:26, Jeff Eastman<jdog@windwardsolutions.com>  
>> wrote:
>>
>>>   It's alive, how marvelous! On the number of clusters, I am 
>>> uncertain. You
>>> may indeed have uncovered a<choke>  defect. Let's work on 
>>> characterizing
>>> that a bit more. I get that you ran the mahout kmeans command with 
>>> -k 600
>>> and only found 175 clusterIds referenced in the clusteredPoints 
>>> directory.
>>> How many clusters were in your -c directory? That would be the initial
>>> clusters produced by the RandomSeedGenerator. Try running cluster 
>>> dumper on
>>> that directory. If there are still only 175 clusters then the 
>>> generator has
>>> a problem.
>>>
>>> Canopy is a little hard to parametrize. If you are only getting a 
>>> single
>>> cluster out then the T2 distance you are using is too large. Try a 
>>> smaller
>>> value and the number of clusters should increase dramatically at 
>>> some point
>>> (in the limit to the number of vectors if T2=0). I use a binary 
>>> search to
>>> converge on this value. T1 is less fussy and needs only to be larger 
>>> than
>>> T2. It influences the number of nearby points that are not within T2 
>>> that
>>> also need to contribute to the cluster center. For subsequent k-Means
>>> processing, this is not so important.
>>>
>>> Finally, you should not have had to modify the cluster dumper to handle
>>> NamedVectors, as the AbstractCluster.formatVector(dictionary) method it
>>> calls should handle it. I would expect to see the name produced by
>>> AbstractCluster.formatVector(v,bindings) in the<clusterId,<weight,
>>> vector>>  tuples it outputs after the cluster description. Can you 
>>> verify
>>> that this is not the case? If so can you help to further 
>>> characterize that?
>>>
>>> I understand you are in the middle of your MSc. Good luck with that!
>>> Jeff
>>>
>>>
>>>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message