mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Failure to run Clustering example
Date Fri, 01 May 2009 14:18:51 GMT
Hi Shashi,

Until we have element labels on our Vectors 
(http://issues.apache.org/jira/browse/MAHOUT-65) you will have to keep a 
separate map or list of the ID to Vector index associations and pass it 
to your mappers/reducers in a configuration file. You could use Gson, 
which is already in Mahout/lib, to encode this information in the file 
system. Other than that, your plan should yield a set of sparse document 
vectors which you can then cluster using one of the clustering jobs.

I'd be interested in how the various algorithms perform. Would you 
consider submitting the index generation code to Mahout? I'm sure many 
users would find it useful.

Jeff

Shashikant Kore wrote:
> Here is my plan to create the document vectors.
>
> 1. Create Lucene index for all the text files.
> 2. Iterate on the terms in the index and assign an ID to each term.
> 3. For each text file
>    3a. Get terms of the file.
>    3b. Get TF-IDF score of each term from the lucene index. In
> document vector store this score along with ID. The document vector
> will be a sparse vector.
>
> Can this now be given as input to the clustering code?
>
> Thanks,
> --shashi
>
> On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>   
>> On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
>>
>>     
>>> Hi Jeff,
>>>
>>> The JDK problem occurs while running the example of Synthetic Control Data
>>> from
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>>>
>>>
>>> The other query was related to how to convert convert text files to
>>> Mahout Vector. Let's say, I have text files of wikipedia pages and now
>>> I want to create clusters out of them. How do I get the Mahout vector
>>> from the lucene index? Can you point me to some theory behind it, from
>>> where I can convert it code?
>>>       
>> I don't think we have any demo code for this yet.  I have a personal task
>> that I'm trying to get to that will demonstrate how to cluster text starting
>> from a plain text file, but nothing in code yet, especially not anything
>> that takes it from Lucene.  All of these would be great additions to have.
>>  I think Richard Tomsett said he had some code to do it, but hasn't donated
>> it yet.  He's also put up a patch for doing cosine distance metric, but it
>> is not committed yet.
>>
>> Cheers,
>> Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>     
>
>
>   


Mime
View raw message