mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: Clustering large files using hadoop?
Date Wed, 19 Sep 2012 06:28:28 GMT
KMeansDriver has a run method with a flag runSequential. When you will 
mark it to false, it will use the hadoop cluster to scale. kmeans 
command is also having this flag.

"

In the process, I have been able to vectorize the data points  and use the
clustering results of K-means to feed it as the initial centroid to Fuzzy
K-means clustering.

"
You can also use Canopy clustering for initial seeding, as its a single 
iteration clustering algorithm and produces good results if proper t1,t2 
values are provided.
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering


On 19-09-2012 11:47, Rahul Mishra wrote:
> I have been able to cluster and generate results for small csv files(having
> only continuous values) on a local system using eclipse and it works
> smoothly.
> In the process, I have been able to vectorize the data points  and use the
> clustering results of K-means to feed it as the initial centroid to Fuzzy
> K-means clustering.
>
> But, in the end I am able to do it only for small files . For files having
> 2 million rows, it simply shows error out of memory.
> But, since Mahout is for large scale machine learning , how do I convert my
> code to use the power of map-reduce framework of hadoop.[info: I have
> access to a 3-node Cluster having hadoop]
> Can anyone suggest a step-by-step procedure?
>
> I have also looked into the clustering chapters of the book "Mahout in
> Action" but to my dismay did not find any clue.
>



Mime
View raw message