mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Tilak <>
Subject Mahout for clustering
Date Tue, 03 Dec 2013 00:22:03 GMT

Hi All,We are using Apache Pig for building our data pipeline. We have data in the following
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code.  I would like to use mahout for clustering it. 
Based on my current  reading I see following few options
1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 -> 1, AAAAA2
->2 etc. Then run the clustering algorithm on the reformatted data and then map the results
back onto the real item codes.2. Represent info on item codes  as 1 X M matrix where a column
represents an items (1 if a given user has viewed a particular item 0 otherwise) and will
have millions of columns. So each user will have id, age, and this matrix. Not sure if this
will work…..
We also want to do frequency pattern mining etc. on the same data. Any thoughts on data representation
and clustering will be great.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message