mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Bellasio <stefanobella...@gmail.com>
Subject Re: How to edit dataset for SVD recommendations with DistributedLanczos?
Date Mon, 06 Dec 2010 17:20:45 GMT
Thanks :) Found it! Well i think that the part useful for me is this one:

 private List<VectorWritable> sampleData;

  private String[] termDictionary;

  @Override
  @Before
  public void setUp() throws Exception {
    super.setUp();
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    // Create test data
    getSampleData(DOCS);
    ClusteringTestUtils.writePointsToFile(sampleData, true, getTestTempFilePath("testdata/file1"),
fs, conf);
  }

  private void getSampleData(String[] docs2) throws IOException {
    sampleData = new ArrayList<VectorWritable>();
    RAMDirectory directory = new RAMDirectory();
    IndexWriter writer = new IndexWriter(directory,
                                         new StandardAnalyzer(Version.LUCENE_30),
                                         true,
                                         IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < docs2.length; i++) {
      Document doc = new Document();
      Fieldable id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      // Store both position and offset information
      Fieldable text = new Field("content", docs2[i], Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.YES);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    IndexReader reader = IndexReader.open(directory, true);
    Weight weight = new TFIDF();
    TermInfo termInfo = new CachedTermInfo(reader, "content", 1, 100);

    int numTerms = 0;
    for (Iterator<TermEntry> it = termInfo.getAllEntries(); it.hasNext();) {
      it.next();
      numTerms++;
    }
    termDictionary = new String[numTerms];
    int i = 0;
    for (Iterator<TermEntry> it = termInfo.getAllEntries(); it.hasNext();) {
      String term = it.next().term;
      termDictionary[i] = term;
      System.out.println(i + " " + term);
      i++;
    }
    VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
    Iterable<Vector> iterable = new LuceneIterable(reader, "id", "content", mapper);

    i = 0;
    for (Vector vector : iterable) {
      assertNotNull(vector);
      NamedVector namedVector;
      if (vector instanceof NamedVector) {
        //rename it for testing purposes
        namedVector = new NamedVector(((NamedVector) vector).getDelegate(), "P(" + i + ')');

      } else {
        namedVector = new NamedVector(vector, "P(" + i + ')');
      }
      System.out.println(AbstractCluster.formatVector(namedVector, termDictionary));
      sampleData.add(new VectorWritable(namedVector));
      i++;
    }
  }


Can i pass to sampledata my dataset and then using something like public void testKmeansSVD()
...am i right? Thanks
Il giorno 06/dic/2010, alle ore 18.04, Derek O'Callaghan ha scritto:

> Hi Stefano,
> 
> The class can be found in mahout-utils/src/test/java.
> 
> Derek
> 
> On 06/12/10 16:54, Stefano Bellasio wrote:
>> Hi Derek, thanks! I'm looking in my mahout files, and i can't find this class under
org.apache.mahout.clustering.TestClusterDumper, is there or another package?
>> Il giorno 06/dic/2010, alle ore 14.21, Derek O'Callaghan ha scritto:
>> 
>>   
>>> Hi Stefano,
>>> 
>>> TestClusterDumper has a few test methods which perform SVD with clustering, e.g.
testKmeansSVD(). These methods demonstrate the creation of a matrix for use with SVD, so I
think they might help to give you an overview of what's required.
>>> 
>>> Regards,
>>> 
>>> Derek
>>> 
>>> On 04/12/10 18:04, Stefano Bellasio wrote:
>>>     
>>>> Do i need to put all data in a matrix i think, but how? I used SVD command
line of Mahout with seqdirectory and seq2sparse, but without success :) Well i think i need
something like this finally: http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html#a1407801
but for recommendations. Can you help me with some suggestions or tutorials? I see that there
is much interest in SVD and DistributedLanczos but really few suggestions and tutorials. Thank
you again for your
>>>> 
>>>>       
>> 
>>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message