mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apurv Khare <Apurv.Kh...@lntinfotech.com>
Subject RE: How to Analyse K-mean Clustering output
Date Tue, 25 Jun 2013 03:53:55 GMT
Hey Ted,

This is my java code 

package pkg;

import java.awt.Point;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStream;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.classify.WeightedVectorWritable;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.clustering.kmeans.Kluster;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.utils.clustering.ClusterDumperWriter;
import org.apache.mahout.utils.vectors.csv.CSVVectorIterator;

@SuppressWarnings("unused")
public class KmeanCSV {
	public static List<Vector> points;

	@SuppressWarnings("deprecation")
	public static void main(String[] args) throws IOException,
			InterruptedException, ClassNotFoundException {

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		// Path of my CSV in HDFS
		String input = "/user/hduser/inputdata/customerdatatest.csv";
		// Path to write the vectors in a sequence File in HDFS
		String output = "/user/hduser/inputdata/kmeans-vector/vector.seq";

		FSDataInputStream inputStream = fs.open(new Path(input));

		/*
		 * String line = null; while((line = inputStream.readLine()) != null) {
		 * System.out.println(line); }
		 */

		System.out.println("Done Reading");
		SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
				new Path(output), LongWritable.class, VectorWritable.class);

		String line1;
		long counter = 0;
		points = new ArrayList<Vector>();
		while ((line1 = inputStream.readLine()) != null) {
			String[] c = line1.split(",");
			double[] d = new double[c.length];
			for (int i = 0; i < c.length; i++)
				d[i] = Double.parseDouble(c[i]);
			Vector vec = new RandomAccessSparseVector(c.length);
			vec.assign(d);

			points.add(vec);
			VectorWritable writable = new VectorWritable();
			writable.set(vec);

			writer.append(new LongWritable(counter++), writable);
		}

		writer.close();

		System.out.println("Done Writing");
		int k = 40;

		Path path = new Path(
				"/user/hduser/inputdata/kmeans-vector/clusters/part-00000");

		SequenceFile.Writer writer1 = new SequenceFile.Writer(fs, conf, path,
				Text.class, Kluster.class);

		for (int i = 0; i <= k; i++) {
			
			 Vector vec = points.get(i);
			
			  Kluster cluster = new Kluster(vec, i, new
			  EuclideanDistanceMeasure()); writer1.append(new
			  Text(cluster.getIdentifier()), cluster);
			
		}
		writer1.close();

		 KMeansDriver.run(conf, new
		 Path("/user/hduser/inputdata/kmeans-vector/vector.seq"), new
		 Path("/user/hduser/inputdata/kmeans-vector/clusters/part-00000"),
		 new Path("/user/hduser/inputdata/output"), new
		 EuclideanDistanceMeasure(), 0.001, 10,
		 true,0.0, false);
		
		 SequenceFile.Reader reader = new SequenceFile.Reader(fs,new
		 Path("/user/hduser/inputdata/output/" + Kluster.CLUSTERED_POINTS_DIR+
		 "/part-m-00000"), conf);
		 IntWritable key = new IntWritable();
		 WeightedVectorWritable value = new WeightedVectorWritable();
		 FileSystem fileSystem = FileSystem.get(conf);
		 FSDataOutputStream out = fileSystem.create(new Path("/user/hduser/inputdata/output/cluster.txt"));
				 while (reader.next(key, value)) {
//				out.writeBytes(value.toString());
					 out.writeChars(value.toString()+" belongs to cluster " +
		 key.toString());
			
			System.out.println(value.toString() + " belongs to cluster " + key.toString());
		 }
		 reader.close();

	}
}

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, June 24, 2013 7:11 PM
To: user@mahout.apache.org
Subject: Re: How to Analyse K-mean Clustering output

What code?


On Mon, Jun 24, 2013 at 8:00 AM, Apurv Khare <Apurv.Khare@lntinfotech.com>wrote:

>  Hi, ****
>
> I am using clustering for one of my POC. ****
>
> ** **
>
> My data looks like :****
>
> ** **
>
> Id****
>
> Gender****
>
> Education****
>
> Occupation****
>
> Income****
>
> Age****
>
> State****
>
> Marital Status****
>
> children****
>
> Duration of Relationship****
>
> 1****
>
> 1****
>
> 19****
>
> 3****
>
> 1****
>
> 20****
>
> 1****
>
> 3****
>
> 1****
>
> 2****
>
> 2****
>
> 1****
>
> 16****
>
> 15****
>
> 1****
>
> 40****
>
> 7****
>
> 2****
>
> 3****
>
> 2****
>
> ** **
>
> But for the Clustering I’m excluding the ID field, as it is an 
> sequential id so I thought won’t be helpful in distance calculation 
> and may hinder the cluster process.****
>
> ****
>
> I have following code in which I’m using Kmean clustering , the output 
> of the kmean is bit confusing for me. How can I use it for my 
> analysis.****
>
> ** **
>
> I found some Utility in Kmeans “Cluster dump ”. How can I use it my 
> java
> code??****
>
> ** **
>
> And then how to use the output of cluster dump for mapping the values 
> in my real data.****
>
> ** **
>
> It would be really a help as we to present a demo for our client. ****
>
> ** **
>
> Thanks & Regards,****
>
> Apurv Khare****
>
> ** **
>
> ------------------------------
> The contents of this e-mail and any attachment(s) may contain 
> confidential or privileged information for the intended recipient(s). 
> Unintended recipients are prohibited from taking action on the basis 
> of information in this e-mail and using or disseminating the 
> information, and must notify the sender and delete it from their 
> system. L&T Infotech will not accept responsibility or liability for 
> the accuracy or completeness of, or the presence of any virus or disabling code in this
e-mail"
>
Mime
View raw message