hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@gmail.com>
Subject Re: Design a datastore maintaining historical view of users.
Date Tue, 13 Jan 2015 03:06:19 GMT

I'm doing something comparable right now, but not with such a HUGE
database O_o. 10 Mio results for such a query? This would mean that you
have 100 Mio -> 1 Billion customers ?!?!

However: in my opinion with such a huge database HBase is a good fit.
However, your data model should be changed for that purpose.

You could use something like
<userid>-<item_id> (or the other way around)
for a row key and
data:clicktime => <timestamp>
or whatever for data ....

for the "item table". In this example the "e-mail" table does not have
to be touched (I assume they are separated)

1.) Map reduce

Your problem make a good point for map reduce. Just map over the e-mail
events, when you find an e-mail that fits your question, make a get on
the user which is connected to the e-mail and the item_id (row key
design above). If so => emit and use an hbase table as sink. By that you
would make sure that the user is only processed once (as the e-mail
could be opened twice or so) and you would have a nice list in a
separate table which you could use.

For such a huge database the question of 30 min or less then is "just" a
question of money for hbase/yarn nodes.

2.) Simple index

I thought of simple indexes, too. However, this only would work properly
if you would make a table where the item_id would be the primary key for
the index and the users browse events would make up the column family.

with <ts> the specific timestamp of the action something like
item_1    user:alice <ts>    user:bob <ts>
item_2    user:alice <ts>                              user:charles <ts>

(hope the e-mail is rendered okay to explain what I mean)

Then you could simply fetch the <item_id> row and use that to check if
the e-mail action exist.

However, this index would be a table of <number of items> rows and
<number of users> qualifiers in one column family. Of course it would be
sparse, but would make up by above estimate 100 Mio -> 1 Billion
qualifiers in the main column family. I never ran a table THAT wide.
Does anyone on this list did something like that and can comment?

Furthermore you would have to iterate over the result ... which has to
be estimated >> 10 Mio items, which sounds like a pretty bad idea.

3.) Fancy index

You make up an index for exactly your question from above.

Row keys would be
where <e-mail-action> is "click", "erase", ....

with data
email:1 => <rowkey of the email event>
email:2 => <rowkey of the email event>
(as some events could occure more than once)

item:1 => <rowkey of the item browse event>
item:2 => <rowkey of the item browse event>

Then you would have your list right away e.g. by setting up a scan for
the specific range (which is the answer to your specific question in
this case).

But you would have to make on every e-mail event a scan over the list of
items the user has browsed (by design above a simple scan in the "item
table" for the specific user) and fill the index.

Which actually sounds like an even worse idea.


But as I read your e-mail, this example is just one of the query of many
different queries, right? And for every query you need one such index.

Normally I'm in favour of indexes, but on this case I think you really
need a clever row key design of your separate tables and the full power
of map reduce.

However, I saw your analogue postings on the other "nosql" mailing
lists. As cassandra have a "join" feature in the cql this would be a
good fit ... at first sight. Furthermore the querying for ElasticSearch
can easily be used in a way which satisfy your need. However, by design
(as I understand what c* is doing on joins) cassandra would do A LOT of
communication for that case (i.e. inefficient), and elastic search will
probably break down if the master would have to give you 10 Mio results
back. Thus you would have to iterate A LOT of pages to get your full
result, which means a lot of JSON parsing and http stuff.

Even if hbase is not the best choice on the first sight, I think it
could be the best choice nevertheless.

Best wishes,


ps: man ... that's a lot of data. Is that you, amazon? ;)

Am 13.01.2015 um 01:42 schrieb Chen Wang:
> Hey Guys,
> I am seeking advice on design a system that maintains a historical view of
> a user's activities in past one year. Each user can have different
> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
> The query I would like to do is, for example,
> Find all customers who browse item A in the past 6 month, and also clicked
> an email.
> and I would like the query to be done in reasonable time frame. (for
> example, within 30 minutes to retrieve 10million such users)
> Since we already have HBase cluster in place, HBase becomes my first
> choice. So I can have customer_id as the row key, column family be
> 'Activity', then have certain attributes associated with the column
> family,something like:
> custer_id, browse:{item_id:12334, timestamp:epoc}
> However, It seems that HBase would not be a good choice for supporting the
> queries above. Even its possible with scan, it will be super inefficient
> due to the size of the data set.
> Is my understanding correct and I should resort to other data store.(ES in
> my opinion). or has anyone done similar thing with HBase?
> Thanks in advance.
> Chen

View raw message