mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek Agarwal <>
Subject Re: K-Means matrix
Date Wed, 25 Feb 2009 11:07:25 GMT
I am no Mahout expert, but do knw MR, CF and ML.
I feel if you do a sparse matrix implementation of K Means things will be
much simpler and manageable.

rough calculation:
~1M unique users
~1000 unique sites a user visits in a month
~1G entries
~ 4 bytes per entry
~ 4G Data
~ x10(overheads) 40G
~ Distribute data to multiple machines grouped by user id.
- KMeans is a perfect algo to be implemented in MR paradigm.

I dont think creation and clustering is gonna take any thing more than
couple of hours.
I have done similar experiments at much larger scale, and it works. Let me
know if you have any issues.


On Wed, Feb 18, 2009 at 12:13 AM, Marcus Herou

> Hi.
> Been visiting some mailinglists trying to find directions for howto find
> address a quite cool problem. The answer have so far been. "Look at K-Means
> clustering". Since I'm quite familiar with Hadoop which is incorporated in
> both our crawler and into our webstats engine it seemed that the right gang
> to turn to was the Mahout users.
> Basically we have 40k sites in our network of which we track weblogstats
> like unique browsers, page impressions and sessions etc.
> We want to learn more about our network so I've started to develop a
> solution which would create a similarity matrix so you could say:
> These 10 sites are most similar to Site X in terms of visiting patterns
> i.e.
> same kind of audience.
> We have one big problem though... It will take 5 years to compute this
> matrix at the current implementation speed :) That's why I'm starting to
> look elsewhere.
> The matrix is really simple and below is an example
> site1 site2 site3....
> uid1   X               X
> uid2            X      X
> uid3   X
> ....
> or table wise could be something like
> CREATE TABLE UniqueSiteVisitorSample(
> s1 bit,
> s2 bit,
> s3 bit,
> ....
> uid bigint
> )
> Where the X (bit set) means that one visitor visited a specific site, so
> two
> sites with many common "X's" is similar...
> As I said _very_ simple datastructure but large and I don't know of any
> storage mechanism where you could store 40k+ columns...
> Would it be feasible to use Mahout to create some output which stated how
> similar SiteX is with SiteA,SiteB etc ?
> If it takes some hours (or days) to compute then it's quite OK from my
> standpoint since the matrix could be recreated every X weeks or so.
> Hope anyone here at the list thinks, "Man this guy is stupid, he should do
> it like this!":)
> /Marcus
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312

Abhishek Agarwal

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message