mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angelo Immediata <angelo...@gmail.com>
Subject Re: KMeans cluster analysis
Date Fri, 06 Dec 2013 08:12:16 GMT
Hi Ted

First of all thank you for the support and I'm sorry if I was not too clear
ina what was my final intend

The final intend of my cluster analysis is to find clusters to the whose
records belong. For the cluster analysis I need to take care of the
following variables: meteo, manifestation, day of the week, month of the
year, hour of the day, vacation

For example I'ld like to have a similar result: the record with ID X
belongs to the cluster C1, the record with ID Y belongs to the cluster C2
etc....
Where cluster C1 represents a cluster where all objects have the following
similarity: hour of the day=8am, Manifestation=1, meteo=5, month of the
year=4, vacation=0 and cluster C2 represents a cluster where all objects
have the following similarity: hour of the day=8am, Manifestation=2,
meteo=2, month of the year=5, vacation=2

For me (sure becouse I'm really newbie to these topics...) what is
difficult to understand is how to correctly generate the input data for the
cluster analysis by considering the variables meteo, manifestation, day of
the week, month of the year, hour of the day and vacation

Thank again

Angelo



2013/12/6 Ted Dunning <ted.dunning@gmail.com>

> Angelo,
>
> The first question is how you intend to define which items are similar.
>
> Also, what is the intended use of the clustering?  Without knowing that, it
> is very hard to say how to best do the clustering.
>
> For instance, are two records more similar if the record are at the same
> time of day?  Or do you really want to cluster arcs by getting all of the
> records for a single arc and finding other arcs which have similar
> characteristics in different weather conditions and time of day?
>
> Without some more idea about what is going on, it will not be possible for
> you to succeed with clustering, nor for us to help you.
>
>
>
> On Thu, Dec 5, 2013 at 3:38 AM, Angelo Immediata <angeloimm@gmail.com
> >wrote:
>
> > Hi
> >
> > First of all I'm sorry if I repeat this question..but it's pretty old one
> > and I really need some help since I'm a really newbie to mahout and
> hadoop
> >
> > I need to do some cluster analysis by using some data. At the beginning
> > this data can be not too much huge, but after some time they can be
> really
> > huge (I did some calculation and after 1 year this data cann be around 37
> > billion of records) Since I have this huge data, I decided to do the
> > cluster analysis by using Mahout on the top of Apache Hadoop and its
> HDFS.
> > Regarding where to store this big amount of data I decided to use Apache
> > HBase always on the top of Apache Hadoop HDFS
> >
> > Now I need to do this cluster analysi by considering some environment
> > variables. These variable may be the following:
> >
> >    - *recordId* = id of the record
> >    - *arcId *= id of the arc between 2 points of my "street graph"
> >    - *mediumVelocity *= medium velocity of the considered arc in the
> >    specified
> >    - *vehiclesNumber* = number of the monitored vehicles in order to get
> >    that velocity
> >    - *meteo *= weather condition (a numeric representing if there is sun,
> >    rain etc...)
> >    - *manifestation *= a numeric representing if there is any kind of
> >    manifestation (sport manifestation or other)
> >    - *day of the week*
> >    - *month of the year*
> >    - *hour of the day*
> >    - *vacation *= a numeric representing if it's a vacation day or a
> >    working day
> >
> > So my data are so formatted (raw representation):
> >
> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
> > weekDay yearMonth dayHour vacation*
> > 1         1      34.5            20            1      3            4
> >    2011       10      3
> > 2         156    66.5            3             2      5            1
> >    2008        6      2
> >
> > The clustering should be done by taking care of at least these variables:
> > meteo, manifestation, weekDay, dayHour, vacation
> >
> > No in order to take data from HBase I used the MapReduce funcionalities
> > provided by HBase; basically I wrote this code:
> >
> > My MapperReducer class:
> >
> > package hadoop.mapred;
> >
> > import hadoop.hbase.model.HistoricalDataModel;
> >
> > import java.io.IOException;
> >
> > import org.apache.commons.logging.Log;
> >
> > import org.apache.commons.logging.LogFactory;
> >
> > import org.apache.hadoop.hbase.client.Result;
> >
> > import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
> >
> > import org.apache.hadoop.hbase.mapreduce.TableMapper;
> >
> > import org.apache.hadoop.hbase.mapreduce.TableReducer;
> >
> > import org.apache.hadoop.hbase.util.Bytes;
> >
> > import org.apache.hadoop.io.IntWritable;
> >
> > import org.apache.hadoop.io.SequenceFile;
> >
> > import org.apache.hadoop.io.Text;
> >
> > import org.apache.hadoop.io.Writable;
> >
> > import org.apache.hadoop.mapred.join.TupleWritable;
> >
> > public class HistoricalDataMapRed {
> >
> > public static class HistoricalDataMapper extends TableMapper<Text,
> > TupleWritable> {
> >
> > private static final Log logger =
> > LogFactory.getLog(HistoricalDataMapper.class.getName());
> >
> > private int numRecords = 0;
> >
> > @SuppressWarnings({ "unchecked", "rawtypes" })
> >
> > protected void map(Text key, Result result,
> > org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException,
> > InterruptedException {
> >
> > try{
> >
> > Writable[] vals = new Writable[4];
> >
> > IntWritable calFest = new
> >
> >
> IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
> > HistoricalDataModel.CALENDARIO_FESTIVO)));
> >
> > vals[0] = calFest;
> >
> > IntWritable calEven = new
> >
> >
> IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
> > HistoricalDataModel.CALENDARIO_EVENTI)));
> >
> > vals[1] = calEven;
> >
> > IntWritable meteo = new
> >
> >
> IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
> > HistoricalDataModel.EVENTO_METEO)));
> >
> > vals[2] = meteo;
> >
> > IntWritable manifestazione = new
> >
> >
> IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY,
> > HistoricalDataModel.MANIFESTAZIONE)));
> >
> > vals[3] = manifestazione;
> >
> > String chiave = Bytes.toString(result.getRow());
> >
> > Text text = new Text();
> >
> > text.set(chiave);
> >
> > context.write(text, new TupleWritable(vals));
> >
> > numRecords++;
> >
> > if ((numRecords % 10000) == 0) {
> >
> > context.setStatus("mapper processed " + numRecords + " records so far");
> >
> > }
> >
> > }catch(Exception e){
> >
> > String message = "Errore nel mapper; messaggio errore: "+e.getMessage();
> >
> > logger.fatal(message, e);
> >
> > throw new IOException(message);
> >
> > }
> >
> > }
> >
> > }
> >
> > public static class HistoricalDataReducer extends
> > TableReducer<ImmutableBytesWritable, TupleWritable,
> ImmutableBytesWritable>
> > {
> >
> > private static final Log logger =
> > LogFactory.getLog(HistoricalDataReducer.class.getName());
> >
> > @SuppressWarnings({ "rawtypes", "unchecked" })
> >
> > protected void reduce(Text key, Iterable<TupleWritable> values,
> > org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException,
> > InterruptedException {
> >
> > try{
> >
> > context.write(key, values);
> >
> > }catch(Exception e){
> >
> > String message = "Errore nel mapper; messaggio errore: "+e.getMessage();
> >
> > logger.fatal(message, e);
> >
> > throw new IOException(message);
> >
> > }
> >
> > }
> >
> > }
> >
> > }
> >
> > And then I wrote:
> >
> > try {
> >
> > Configuration conf = HBaseConfiguration.create();
> >
> > Job job = new Job(conf, "HBase_historicaldataJob");
> >
> > job.setJarByClass(HistoricalDataMapper.class);
> >
> > Scan scan = new Scan();
> >
> > scan.addFamily(HistoricalDataModel.HISTORICAL_DATA_FAMILY);
> >
> > scan.setCaching(500);
> >
> > scan.setCacheBlocks(false);
> >
> > TableMapReduceUtil.initTableMapperJob(
> >
> > ClusteringHistoricalDataDao.HBASE_TABLE_NAME,
> >
> > scan,
> >
> > HistoricalDataMapper.class,
> >
> > ImmutableBytesWritable.class,
> >
> > TupleWritable.class,
> >
> > job
> >
> > );
> >
> > job.setReducerClass(HistoricalDataReducer.class);
> >
> > job.setNumReduceTasks(2);
> >
> > Path path = new Path("/tmp/mr/mySummaryFile");
> >
> > HadoopUtil.delete(conf, path);
> >
> > FileOutputFormat.setOutputPath(job, path);  // adjust directories as
> > required
> >
> > boolean b = job.waitForCompletion(true);
> >
> > } catch (Exception e) {
> >
> > logger.fatal("Errore ", e);
> >
> > throw new IllegalStateException(e);
> >
> > }
> >
> > By doing in this way I can generate my SequenceFile with the input data;
> > now I should use it in order to do the cluster analysis; here there is
> the
> > problem; how can I use the generated file in order to make a cluster
> > analysis by taking care of the variable previously listed?
> >
> > Moreover, is this approach a good approach in order to make the cluster
> > analysis?
> >
> > I searched around, but I was not able in finding any good sample Any
> > suggestion would be really appreciated
> >
> > Thank you Angelo
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message