mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Swift <dsw...@pccowboy.com>
Subject Re: Question regarding clustering vector files
Date Thu, 04 Oct 2012 02:03:45 GMT
Jeff,

Thank you very much for the tip, my clustering job is now running.  yahoo!

Thanks again,
David

----- Original Message -----
From: "Jeff Eastman" <jdog@windwardsolutions.com>
To: user@mahout.apache.org
Sent: Wednesday, October 3, 2012 6:25:02 PM
Subject: Re: Question regarding clustering vector files

Hard to tell without seeing the stack dump that raised the exception but 
consider: In order to create the ModelDistribution, a Vector prototype 
is created of the size of the first data record read. This then 
configures the size of the corresponding Models created by the 
distribution. If any of your input vectors are larger than this 
prototype size it could cause the index exception you are seeing. 
Suggest you create your sparse vectors with Integer.MAX_INT size to work 
around this. They won't take up any more space and the algorithm will be 
more forgiving.


On 10/3/12 1:04 PM, David Swift wrote:
> I am attempting to use the dirichlet clusterer, and I am getting an error like:
> org.apache.mahout.math.IndexException: Index 204 is outside allowable range of [0,99)
>          at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:172)
>
> I could use an extra pair of eyes on my process, very glad for any pointers in the right
direction.
>
> I am executing the clusterer with:
>   bin/mahout dirichlet --input 2_vect.out --output 2_cluster --maxIter 100 --numClusters
5
>
>
> I have prepared a vector file from a directory of files using the following code, modeled
from the LastfmDataConverter and a SequenceFileWriteDemo (snipped stuff that is obvious like
the imports and the 'main' method):
>
>    public static Map<String, List<Integer>> convertToItemFeatures(String
inputFile, Map<String
> , List<Integer>> itemFeatures, Map<String, Integer> featureIdxM) throws
IOException {
>      BufferedReader br = Files.newReader(new File(inputFile), Charsets.UTF_8);
>      try {
>        String line;
>        System.out.print("Reading " + inputFile + "\n");
>        while ((line = br.readLine()) != null) {
>          // get the featureIdx
>          Integer featureIdx = featureIdxM.get(line);
>          if (featureIdx == null) {
>            featureIdx = featureIdxM.size() + 1;
>            featureIdxM.put(line, featureIdx);
>          }
>          // add it to the corresponding feature idx map
>          List<Integer> features = itemFeatures.get(inputFile);
>          if (features == null) {
>            features = Lists.newArrayList();
>            itemFeatures.put(inputFile, features);
>          }
>          features.add(featureIdx);
>        }
>      } finally {
>        Closeables.closeQuietly(br);
>      }
>      return itemFeatures;
>    }
>
>    /**
>     * Converts each record in (item,features) map into Mahout vector format and
>     * writes it into sequencefile for minhash clustering
>     */
>    public static boolean writeToSequenceFile(Map<String, List<Integer>> itemFeaturesMap,
Path outputPath)
>      throws IOException {
>      Configuration conf = new Configuration();
>      FileSystem fs = FileSystem.get(conf);
>      fs.mkdirs(outputPath.getParent());
>      long totalRecords = itemFeaturesMap.size();
>      SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath, Text.class,
VectorWritable.class);
>      try {
>        String msg = "Now writing vectorized data in sequence file format: ";
>        System.out.print(msg);
>
>        Text itemWritable = new Text();
>        VectorWritable featuresWritable = new VectorWritable();
>
>        for (Map.Entry<String, List<Integer>> itemFeature : itemFeaturesMap.entrySet())
{
>          int numfeatures = itemFeature.getValue().size();
>          itemWritable.set(itemFeature.getKey());
>          Vector featureVector = new SequentialAccessSparseVector(numfeatures);
>          int i = 0;
>          for (Integer feature : itemFeature.getValue()) {
>            featureVector.setQuick(i++, feature);
>          }
>          featuresWritable.set(featureVector);
>          writer.append(itemWritable, featuresWritable);
>        }
>      } finally {
>        Closeables.closeQuietly(writer);
>      }
>      return true;
>    }
>
>    public static Map<String, List<Integer>>  listFilesForFolder(final File
folder) {
>      Map<String, Integer> featureIdxMap = Maps.newHashMap();
>      Map<String, List<Integer>> itemFeaturesMap = Maps.newHashMap();
>
>      File[] listOfFiles = folder.listFiles();
>      for (int i = 0; i < listOfFiles.length; i++) {
>          if (listOfFiles[i].isFile()) {
>             try {
>                 String files = listOfFiles[i].getCanonicalPath();
>                 System.out.println(files);
>                 convertToItemFeatures(files, itemFeaturesMap, featureIdxMap);
>                 System.out.print("Size of features == " + featureIdxMap.size() + "\n");
>                 System.out.print("Size of itemFeatures == " + itemFeaturesMap.size()
+ "\n");
>             }
>             catch (Exception e) {
>                 System.out.print("Ugh " + e.getMessage());
>             }
>          }
>      }
>
>      return itemFeaturesMap;
>    }
>
> Any ideas why my vector file cannot be read?  Do I need to run seq2sparse on it still?
>
>


Mime
View raw message