mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: Clustering large files using hadoop?
Date Wed, 19 Sep 2012 10:45:38 GMT
This code is putting everything in points ( which I think is some sort 
of collection ). This will obviously throw OOM for large files.
The vectors should be added to a  sequence file and then the path to 
that sequence file should be given as input to the clustering algorithm.

Mahout in action has a code snippet which does it. Googling "writing 
into a hdfs sequence file" would also help.

On 19-09-2012 16:06, Rahul Mishra wrote:
> For small file it works absolutely fine. But, I get this error for large
> files :
>   Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>
> Initially, I am reading csv file using the following code and I presume,
> the issue is here. Kindly suggest better approach.
>
>                  CSVReader reader = new CSVReader(new FileReader(inputPath));
> double field = -1;
> int lineCount = 0;
>   String [] nextLine;
>   while ((nextLine = reader.readNext()) != null) {
>   lineCount++;
>   //ArrayList<Double> attributes =  new ArrayList<Double>();
>   double[] d_attrib = new double[4];
> for(int i=0;i<nextLine.length;i++)
>   {
> d_attrib[i] = Double.parseDouble(nextLine[i]);
> // attributes.add(Double.parseDouble(nextLine[i]));
>   }
> //Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
>   NamedVector vec = new NamedVector(new
> RandomAccessSparseVector(nextLine.length)," " + lineCount+" "); //name the
> vector with msisdn
>   vec.assign(d_attrib);
> points.add(vec);
> }
>
>
>
>
> On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> If you have your Hadoop cluster in your environment variables, most Mahout
>> jobs use the cluster by default. So, if you can run 'hadoop fs' and look at
>> your hdfs cluster, Mahout should find your Hadoop cluster.
>>
>> Lance
>>
>> ----- Original Message -----
>> | From: "Paritosh Ranjan" <pranjan@xebia.com>
>> | To: user@mahout.apache.org
>> | Sent: Tuesday, September 18, 2012 11:28:28 PM
>> | Subject: Re: Clustering large files using hadoop?
>> |
>> | KMeansDriver has a run method with a flag runSequential. When you
>> | will
>> | mark it to false, it will use the hadoop cluster to scale. kmeans
>> | command is also having this flag.
>> |
>> | "
>> |
>> | In the process, I have been able to vectorize the data points  and
>> | use the
>> | clustering results of K-means to feed it as the initial centroid to
>> | Fuzzy
>> | K-means clustering.
>> |
>> | "
>> | You can also use Canopy clustering for initial seeding, as its a
>> | single
>> | iteration clustering algorithm and produces good results if proper
>> | t1,t2
>> | values are provided.
>> | https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>> |
>> |
>> | On 19-09-2012 11:47, Rahul Mishra wrote:
>> | > I have been able to cluster and generate results for small csv
>> | > files(having
>> | > only continuous values) on a local system using eclipse and it
>> | > works
>> | > smoothly.
>> | > In the process, I have been able to vectorize the data points  and
>> | > use the
>> | > clustering results of K-means to feed it as the initial centroid to
>> | > Fuzzy
>> | > K-means clustering.
>> | >
>> | > But, in the end I am able to do it only for small files . For files
>> | > having
>> | > 2 million rows, it simply shows error out of memory.
>> | > But, since Mahout is for large scale machine learning , how do I
>> | > convert my
>> | > code to use the power of map-reduce framework of hadoop.[info: I
>> | > have
>> | > access to a 3-node Cluster having hadoop]
>> | > Can anyone suggest a step-by-step procedure?
>> | >
>> | > I have also looked into the clustering chapters of the book "Mahout
>> | > in
>> | > Action" but to my dismay did not find any clue.
>> | >
>> |
>> |
>> |
>>
>
>



Mime
View raw message