mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Tilak <>
Subject RE: Mahout for clustering
Date Tue, 03 Dec 2013 00:30:33 GMT
I am looking for some input on how to vectorize my data. 

> From:
> To:
> Subject: Mahout for clustering
> Date: Mon, 2 Dec 2013 16:22:03 -0800
> Hi All,We are using Apache Pig for building our data pipeline. We have data in the following
> userid, age, items {code 1, code 2, ….}, few other features...
> Each item has a unique alphanumeric code.  I would like to use mahout for clustering
it.  Based on my current  reading I see following few options
> 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 ->
1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted data and then map
the results back onto the real item codes.2. Represent info on item codes  as 1 X M matrix
where a column represents an items (1 if a given user has viewed a particular item 0 otherwise)
and will have millions of columns. So each user will have id, age, and this matrix. Not sure
if this will work…..
> We also want to do frequency pattern mining etc. on the same data. Any thoughts on data
representation and clustering will be great.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message