mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek O'Callaghan <derek.ocallag...@ucd.ie>
Subject Re: How to edit dataset for SVD recommendations with DistributedLanczos?
Date Mon, 06 Dec 2010 17:34:45 GMT
Yeah, that should work. You can pass in a different array to 
getSampleData() instead of DOCS, or change getSampleData() if you want 
to (i.e. changing the current "for (int i = 0; i < docs2.length; ..." 
loop body). I think that should be all you need...


On 06/12/10 17:20, Stefano Bellasio wrote:
> Thanks :) Found it! Well i think that the part useful for me is this one:
>
>   private List<VectorWritable>  sampleData;
>
>    private String[] termDictionary;
>
>    @Override
>    @Before
>    public void setUp() throws Exception {
>      super.setUp();
>      Configuration conf = new Configuration();
>      FileSystem fs = FileSystem.get(conf);
>      // Create test data
>      getSampleData(DOCS);
>      ClusteringTestUtils.writePointsToFile(sampleData, true, getTestTempFilePath("testdata/file1"),
fs, conf);
>    }
>
>    private void getSampleData(String[] docs2) throws IOException {
>      sampleData = new ArrayList<VectorWritable>();
>      RAMDirectory directory = new RAMDirectory();
>      IndexWriter writer = new IndexWriter(directory,
>                                           new StandardAnalyzer(Version.LUCENE_30),
>                                           true,
>                                           IndexWriter.MaxFieldLength.UNLIMITED);
>      for (int i = 0; i<  docs2.length; i++) {
>        Document doc = new Document();
>        Fieldable id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
>        doc.add(id);
>        // Store both position and offset information
>        Fieldable text = new Field("content", docs2[i], Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.YES);
>        doc.add(text);
>        writer.addDocument(doc);
>      }
>      writer.close();
>      IndexReader reader = IndexReader.open(directory, true);
>      Weight weight = new TFIDF();
>      TermInfo termInfo = new CachedTermInfo(reader, "content", 1, 100);
>
>      int numTerms = 0;
>      for (Iterator<TermEntry>  it = termInfo.getAllEntries(); it.hasNext();) {
>        it.next();
>        numTerms++;
>      }
>      termDictionary = new String[numTerms];
>      int i = 0;
>      for (Iterator<TermEntry>  it = termInfo.getAllEntries(); it.hasNext();) {
>        String term = it.next().term;
>        termDictionary[i] = term;
>        System.out.println(i + " " + term);
>        i++;
>      }
>      VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>      Iterable<Vector>  iterable = new LuceneIterable(reader, "id", "content", mapper);
>
>      i = 0;
>      for (Vector vector : iterable) {
>        assertNotNull(vector);
>        NamedVector namedVector;
>        if (vector instanceof NamedVector) {
>          //rename it for testing purposes
>          namedVector = new NamedVector(((NamedVector) vector).getDelegate(), "P(" + i
+ ')');
>
>        } else {
>          namedVector = new NamedVector(vector, "P(" + i + ')');
>        }
>        System.out.println(AbstractCluster.formatVector(namedVector, termDictionary));
>        sampleData.add(new VectorWritable(namedVector));
>        i++;
>      }
>    }
>
>
> Can i pass to sampledata my dataset and then using something like public void testKmeansSVD()
...am i right? Thanks
> Il giorno 06/dic/2010, alle ore 18.04, Derek O'Callaghan ha scritto:
>
>    
>> Hi Stefano,
>>
>> The class can be found in mahout-utils/src/test/java.
>>
>> Derek
>>
>> On 06/12/10 16:54, Stefano Bellasio wrote:
>>      
>>> Hi Derek, thanks! I'm looking in my mahout files, and i can't find this class
under org.apache.mahout.clustering.TestClusterDumper, is there or another package?
>>> Il giorno 06/dic/2010, alle ore 14.21, Derek O'Callaghan ha scritto:
>>>
>>>
>>>        
>>>> Hi Stefano,
>>>>
>>>> TestClusterDumper has a few test methods which perform SVD with clustering,
e.g. testKmeansSVD(). These methods demonstrate the creation of a matrix for use with SVD,
so I think they might help to give you an overview of what's required.
>>>>
>>>> Regards,
>>>>
>>>> Derek
>>>>
>>>> On 04/12/10 18:04, Stefano Bellasio wrote:
>>>>
>>>>          
>>>>> Do i need to put all data in a matrix i think, but how? I used SVD command
line of Mahout with seqdirectory and seq2sparse, but without success :) Well i think i need
something like this finally: http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html#a1407801
but for recommendations. Can you help me with some suggestions or tutorials? I see that there
is much interest in SVD and DistributedLanczos but really few suggestions and tutorials. Thank
you again for your
>>>>>
>>>>>
>>>>>            
>>>
>>>        
>
>    

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message