Hi.
Been visiting some mailinglists trying to find directions for howto find
address a quite cool problem. The answer have so far been. "Look at KMeans
clustering". Since I'm quite familiar with Hadoop which is incorporated in
both our crawler and into our webstats engine it seemed that the right gang
to turn to was the Mahout users.
Basically we have 40k sites in our network of which we track weblogstats
like unique browsers, page impressions and sessions etc.
We want to learn more about our network so I've started to develop a
solution which would create a similarity matrix so you could say:
These 10 sites are most similar to Site X in terms of visiting patterns i.e.
same kind of audience.
We have one big problem though... It will take 5 years to compute this
matrix at the current implementation speed :) That's why I'm starting to
look elsewhere.
The matrix is really simple and below is an example
site1 site2 site3....
uid1 X X
uid2 X X
uid3 X
....
or table wise could be something like
CREATE TABLE UniqueSiteVisitorSample(
s1 bit,
s2 bit,
s3 bit,
....
uid bigint
)
Where the X (bit set) means that one visitor visited a specific site, so two
sites with many common "X's" is similar...
As I said _very_ simple datastructure but large and I don't know of any
storage mechanism where you could store 40k+ columns...
Would it be feasible to use Mahout to create some output which stated how
similar SiteX is with SiteA,SiteB etc ?
If it takes some hours (or days) to compute then it's quite OK from my
standpoint since the matrix could be recreated every X weeks or so.
Hope anyone here at the list thinks, "Man this guy is stupid, he should do
it like this!":)
/Marcus

Marcus Herou CTO and cofounder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
