hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Stein <charmal...@allthingshadoop.com>
Subject many tables vs long rows
Date Tue, 03 Jan 2012 14:39:06 GMT
Hey folks, I am prototyping using hbase to handle our de-duplication needs
using checkAndPut (right now I do it using map/reduce and have built a more
realtime system which this is the last piece to polish off) and have a few
questions/thoughts I wanted to bounce around and get some feedback on
please. thanks!

So, first I want to be able to delete rows that are older than a time
period (like 6 months trailing).  The issue here is I don't think I can use
TTL (unless I can override the timestamp on insert and even if I did not
sure that is good for just billions of rows to get deleted by TTL each day).

Our system is asyncronous and we store > billions of pieces of data per day
and in such a system I could receive data from a mobile device today with a
timestamp from November (or whatever) because now is when the user
connected to the internet and also used the app I am receiving data for the
last time they used it but was not connected to the internet.

So one thought I had was a table for each day this way I could delete
whenever i wanted to ... this seems like a bit of a nightmare, maybe by
month? or week? week feels better....

I guess I am also a little worried about having trillions of rows in a
table but maybe that is not an issue????  just dumping everything in one
mega table just does not feel right.

So far my load tests are going well but there is a lot more to-go, I am
thinking of turning on bloomfilters (already have compression on) as I will
have lots of misses (most of the data 90%+ is NOT duplicate but real) a
bunch of other things I am learning as I go trying to iterate with each
change to our de-duplication code.  I have been really happy and impressed
so far with HBase, great job everyone and thanks!

I guess my next step may just end up being to jump into the code so I can
get a better sense of these things but appreciate any help either in my
questions or pointing things through the code (being on the east coast I
feel thousands of miles away from the action and meetups and the rest but
look forward getting more into things).

Regards

-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop <http://twitter.com/#!/allthingshadoop>
*/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message