mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Clustering from DB
Date Fri, 26 Jun 2009 15:41:46 GMT

On Jun 26, 2009, at 10:20 AM, nfantone wrote:

> Hi to you all, Mahout users. I'm new to the list and to Mahout itself
> and I'm trying to integrate Taste to my project in which I need to
> cluster user data from a very large data set, based on their behavior
> which is stored in some tables in a local data base. From what I've
> read and experimented, clustering in Mahout takes advantage of HDFS
> and Lucene indexing, converting plain CSV files to Vectors. So, I ask:
> is it mandatory to create plain text files (or HDFS files) and indexes
> from the data in my DB so as to feed clustering algorithm's input?
> Couldn't I create, somehow, the Vectors directly and then use them to
> initiate the clustering jobs? Is there any convenient way to achieve
> this? I've not seen anything similar to the "DataModel" interface used
> by Recommenders for JDBC connection (or any other connectivity API)
> and the runJob static methods receive paths for both input and output
> which, a priori, I don't have any use for. Documentation wasn't
> helpful either as the "From a Database" section of "Creating Vectors
> from Text" is currently empty.

The clustering algorithms (on trunk) expect the input file to be a  
Hadoop SequenceFile of <Writable, Vector>

The utils module, contains an interface named VectorIterable which  
could easily be implemented to work with a JDBC connection.  There is  
an implementation of this for Lucene (LuceneIterable).  However, it is  
likely just as easy to write your own ResultSet loop that takes from  
your DB and outputs the SequenceFile.  There are SequenceFile.Writer  
examples in several places in the utils module.  See the Driver class  
in the utils module for example.

Also, FYI, Taste is a separate from what you seem to be implying you  
want to do.  Taste is a collaborative filtering engine that lives in  
Mahout.  Mahout also has several clustering implementations like k- 
Means, Canopy, Dirichlet, etc.

View raw message