mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angelo Immediata <angelo...@gmail.com>
Subject Re: Write SequenceFile from custom data
Date Wed, 04 Dec 2013 14:51:03 GMT
I was thinking to use org.apache.hadoop.mapred.join.TupleWritable in order
to realize my clustering..according to you,...is this a right choice?
Otherwise...how may I implement my scenario?

Thank you
Angelo


2013/12/3 Angelo Immediata <angeloimm@gmail.com>

> well similarity between data should be calculated by taking care of the
> following variables: meteo, manifestation, day of the week, month of the
> year and vacation
>
>
> 2013/12/3 Ted Dunning <ted.dunning@gmail.com>
>
>> The key first question is how you plan to compute similarity between data
>> points.  It isn't clear how you should do this with your data.
>>
>>
>>
>>
>> On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata <angeloimm@gmail.com
>> >wrote:
>>
>> > Hi
>> >
>> > I'm pretty newbie regarding learning achine and above all Apache
>> Mahout, so
>> > pardon me my low level questions
>> >
>> > I need to do some cluster analysis by using some data. At the beginning
>> > this data can be not too much huge, but after some time they can be
>> really
>> > huge (I did some calculation and after 1 year this data cann be around
>> 37
>> > billion of records) Since I have this huge data, I decided to do the
>> > cluster analysis by using Mahout on the top of Apache Hadoop and its
>> HDFS.
>> > Regarding where to store this big amount of data I decided to use Apache
>> > HBase always on the top of Apache Hadoop HDFS
>> >
>> > Now I need to do this cluster analysi by considering some environment
>> > variables. These variable may be the following:
>> >
>> >    - *recordId* = id of the record
>> >    - *arcId *= id of the arc between 2 points of my "street graph"
>> >    - *mediumVelocity *= medium velocity of the considered arc in the
>> >    specified
>> >    - *vehiclesNumber* = number of the monitored vehicles in order to get
>> >    that velocity
>> >    - *meteo *= weather condition (a numeric representing if there is
>> sun,
>> >    rain etc...)
>> >    - *manifestation *= a numeric representing if there is any kind of
>> >    manifestation (sport manifestation or other)
>> >    - *day of the week*
>> >    - *month of the year*
>> >    - *hour of the day*
>> >    - *vacation *= a numeric representing if it's a vacation day or a
>> >    working day
>> >
>> > So my data are so formatted (raw representation):
>> >
>> > *recordId arcId mediumVelocity vehiclesNumber meteo manifestation
>> > weekDay yearMonth dayHour vacation*
>> > 1         1      34.5            20            1      3            4
>> >    2011       10      3
>> > 2         156    66.5            3             2      5            1
>> >    2008        6      2
>> >
>> > As far as I know, in order to do the cluster analysis in Mahout I need
>> to
>> > format my data in Mahout format (that is in a SequenceFile) The question
>> > is: how can I format my data represented as the previously written
>> table in
>> > a SequenceFile? I tried to find something but I was not able in finding
>> any
>> > good sample Any suggestion would be really appreciated
>> >
>> > Thank you Angelo
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message