mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Mishra <mishra.rah...@gmail.com>
Subject Re: Clustering large files using hadoop?
Date Wed, 19 Sep 2012 10:36:56 GMT
For small file it works absolutely fine. But, I get this error for large
files :
 Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded

Initially, I am reading csv file using the following code and I presume,
the issue is here. Kindly suggest better approach.

                CSVReader reader = new CSVReader(new FileReader(inputPath));
double field = -1;
int lineCount = 0;
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
 lineCount++;
 //ArrayList<Double> attributes =  new ArrayList<Double>();
 double[] d_attrib = new double[4];
for(int i=0;i<nextLine.length;i++)
 {
d_attrib[i] = Double.parseDouble(nextLine[i]);
// attributes.add(Double.parseDouble(nextLine[i]));
 }
//Double[] d_attrib= attributes.toArray(new Double[attributes.size()]);
 NamedVector vec = new NamedVector(new
RandomAccessSparseVector(nextLine.length)," " + lineCount+" "); //name the
vector with msisdn
 vec.assign(d_attrib);
points.add(vec);
}




On Wed, Sep 19, 2012 at 3:03 PM, Lance Norskog <goksron@gmail.com> wrote:

> If you have your Hadoop cluster in your environment variables, most Mahout
> jobs use the cluster by default. So, if you can run 'hadoop fs' and look at
> your hdfs cluster, Mahout should find your Hadoop cluster.
>
> Lance
>
> ----- Original Message -----
> | From: "Paritosh Ranjan" <pranjan@xebia.com>
> | To: user@mahout.apache.org
> | Sent: Tuesday, September 18, 2012 11:28:28 PM
> | Subject: Re: Clustering large files using hadoop?
> |
> | KMeansDriver has a run method with a flag runSequential. When you
> | will
> | mark it to false, it will use the hadoop cluster to scale. kmeans
> | command is also having this flag.
> |
> | "
> |
> | In the process, I have been able to vectorize the data points  and
> | use the
> | clustering results of K-means to feed it as the initial centroid to
> | Fuzzy
> | K-means clustering.
> |
> | "
> | You can also use Canopy clustering for initial seeding, as its a
> | single
> | iteration clustering algorithm and produces good results if proper
> | t1,t2
> | values are provided.
> | https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> |
> |
> | On 19-09-2012 11:47, Rahul Mishra wrote:
> | > I have been able to cluster and generate results for small csv
> | > files(having
> | > only continuous values) on a local system using eclipse and it
> | > works
> | > smoothly.
> | > In the process, I have been able to vectorize the data points  and
> | > use the
> | > clustering results of K-means to feed it as the initial centroid to
> | > Fuzzy
> | > K-means clustering.
> | >
> | > But, in the end I am able to do it only for small files . For files
> | > having
> | > 2 million rows, it simply shows error out of memory.
> | > But, since Mahout is for large scale machine learning , how do I
> | > convert my
> | > code to use the power of map-reduce framework of hadoop.[info: I
> | > have
> | > access to a 3-node Cluster having hadoop]
> | > Can anyone suggest a step-by-step procedure?
> | >
> | > I have also looked into the clustering chapters of the book "Mahout
> | > in
> | > Action" but to my dismay did not find any clue.
> | >
> |
> |
> |
>



-- 
Regards,
Rahul K Mishra,
www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message