mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Am I starting right with clustering ?
Date Wed, 03 Aug 2011 15:37:20 GMT
I think you are on the right track but I have some suggestions:
- How many shops do you have in your DB? Unless you have billions of them, you can likely
run the sequential (-xm sequential) algorithms which run locally and are much faster.
- You will want to produce NamedVectors from your database, with the shop_id as the name and
the category vectors as the delegate. I'm not sure if the Mahout ARFF converter will do this
for you or not. It may be simpler to write your own converter using org.apache.mahout.clustering.conversion.InputDriver/Mapper
as prototypes. These will convert space-delimited files to Mahout Vectors but will not produce
NamedVectors. Nor will they produce a dictionary file but your categories seem simple enough
to forego that.
- Once you have created a directory of NV sequence files you should be able to cluster them
easily.

Smooth sailing,
Jeff

-----Original Message-----
From: Clément Notin [mailto:clement.notin@gmail.com] 
Sent: Wednesday, August 03, 2011 7:03 AM
To: user@mahout.apache.org
Subject: Am I starting right with clustering ?

Hello,

I'm new in the Mahout world and it seems really nice but it's hard to get
easy documentation :(

I'm trying to run some clustering. Let me explain you what I'm trying to
achieve.
I have a DB with columns  : shop_id (string), customer_category (string),
num_of_purchases (integer)
What I want to do is to discover groups of shops which are related because
they have some customers categories in common.

I think the vectors should be :
"shop #1" = (1, 10, 0, 20)
which means that the customers category A has bought 1 thing in the shop,
the customers category B has bought 10 things in the shop and so...

In my BD for this example I have :
shop_id    | customer_category | num_of_purchases
--------------+-----------------------------+---------------------
"shop #1" |           "A"              |          1
"shop #1" |           "B"              |          10
"shop #1" |           "D"              |          20


I think I must convert this to an ARFF file like :

@RELATION purchases
@ATTRIBUTE shop_id STRING
@ATTRIBUTE catA NUMERIC
@ATTRIBUTE catB NUMERIC
@ATTRIBUTE catC NUMERIC
@ATTRIBUTE catD NUMERIC

@DATA
"shop #1",1,10,0,20
...

Why ARFF file ? Because I can use the helpful sparse syntax.
But it's difficult to build this file. I think I should write a script.


My question is, am I heading in the good direction ?
I would appreciate some help ! Thanks :)

Regards,

-- 
*Clément **Notin*
Mime
View raw message