HBase is the distributed database that's been developed in conjunction
with Hadoop so it may be worth looking at that for your storage needs
 unfortunately I know nothing about it! However, what you're
suggesting could be achieved simply by storing the data as plain text
(I'm a fan of plain text files for no apparent reason....) like this:
uid [, 1, 0, 0, 0, 1, 0, 1, 0, 0, .......,]
uid [, 1, 0, 1, 0, 1, 1, 1, 0, 0, .......,]
uid [, 0, 0, 0, 0, 0, 0, 1, 0, 0, .......,]
...
...
where a 1 is an X in your example (the commas and square brackets are
how dense vectors are formatted in Mahout). If the rows contain lots
of zeros and not many ones, it would be worth using sparse vectors
instead as Abhishek said (for which there is an implementation in
Mahout :) ).
Sounds like you then just want to measure site similarity using the
Hamming distance between two vectors (users). Actually if you took the
Hamming distance between two vectors as described above that would
give you similarity between two users... so it'd be better for your
needs to construct the vectors with the user ids as the columns and
the document ids as the rows.
KMeans clustering will group the sites into K distinct clusters, then
I guess you could just use the distances between the cluster centres
rather than looking up each individual site, using KMeans almost like
lossy data compression. Does that make sense...?
May also be worth looking into Canopy Clustering. Just a note about
the clustering implementations  I think I'm right in saying that the
current implementations in Mahout don't cater for labelled data points
which you'll need (the uid or siteid at the start of the lines for
each vector). I'm currently working on a text clustering example which
will hopefully help with this.
Hope that helps, and please let me know if I'm talking rubbish (that
goes for everyone...)!
Richard
2009/2/25 Abhishek Agarwal <abhishek.devil@gmail.com>:
> I am no Mahout expert, but do knw MR, CF and ML.
> I feel if you do a sparse matrix implementation of K Means things will be
> much simpler and manageable.
>
> rough calculation:
> ~1M unique users
> ~1000 unique sites a user visits in a month
> ~1G entries
> ~ 4 bytes per entry
> ~ 4G Data
> ~ x10(overheads) 40G
> ~ Distribute data to multiple machines grouped by user id.
>  KMeans is a perfect algo to be implemented in MR paradigm.
>
> I dont think creation and clustering is gonna take any thing more than
> couple of hours.
> I have done similar experiments at much larger scale, and it works. Let me
> know if you have any issues.
>
> Abhishek
>
> On Wed, Feb 18, 2009 at 12:13 AM, Marcus Herou
> <marcus.herou@tailsweep.com>wrote:
>
>> Hi.
>>
>> Been visiting some mailinglists trying to find directions for howto find
>> address a quite cool problem. The answer have so far been. "Look at KMeans
>> clustering". Since I'm quite familiar with Hadoop which is incorporated in
>> both our crawler and into our webstats engine it seemed that the right gang
>> to turn to was the Mahout users.
>>
>> Basically we have 40k sites in our network of which we track weblogstats
>> like unique browsers, page impressions and sessions etc.
>> We want to learn more about our network so I've started to develop a
>> solution which would create a similarity matrix so you could say:
>>
>> These 10 sites are most similar to Site X in terms of visiting patterns
>> i.e.
>> same kind of audience.
>>
>> We have one big problem though... It will take 5 years to compute this
>> matrix at the current implementation speed :) That's why I'm starting to
>> look elsewhere.
>>
>> The matrix is really simple and below is an example
>> site1 site2 site3....
>>
>> uid1 X X
>> uid2 X X
>> uid3 X
>> ....
>>
>> or table wise could be something like
>> CREATE TABLE UniqueSiteVisitorSample(
>> s1 bit,
>> s2 bit,
>> s3 bit,
>> ....
>> uid bigint
>> )
>>
>> Where the X (bit set) means that one visitor visited a specific site, so
>> two
>> sites with many common "X's" is similar...
>> As I said _very_ simple datastructure but large and I don't know of any
>> storage mechanism where you could store 40k+ columns...
>>
>> Would it be feasible to use Mahout to create some output which stated how
>> similar SiteX is with SiteA,SiteB etc ?
>>
>> If it takes some hours (or days) to compute then it's quite OK from my
>> standpoint since the matrix could be recreated every X weeks or so.
>>
>> Hope anyone here at the list thinks, "Man this guy is stupid, he should do
>> it like this!":)
>>
>> /Marcus
>>
>> 
>> Marcus Herou CTO and cofounder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
>
>
> 
> cheers,
> Abhishek Agarwal
> www.it.iitb.ac.in/~abhisheka
>
