hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Stein <charmal...@allthingshadoop.com>
Subject Re: many tables vs long rows
Date Tue, 03 Jan 2012 20:38:30 GMT
inline updates/follow-up, thanks for the good detail response

On Tue, Jan 3, 2012 at 3:14 PM, Stack <stack@duboce.net> wrote:

> On Tue, Jan 3, 2012 at 6:39 AM, Joe Stein
> <charmalloc@allthingshadoop.com> wrote:
> > So, first I want to be able to delete rows that are older than a time
> > period (like 6 months trailing).  The issue here is I don't think I can
> use
> > TTL (unless I can override the timestamp on insert and even if I did not
> > sure that is good for just billions of rows to get deleted by TTL each
> day).
> >
> TTL check happens (mostly) when you major compact so you can control
> it somewhat.
> There is a difference between a TTL and an explicit delete.  With the
> former, older cells are just dropped at compact time.  With the
> latter, a new delete record is added and at query time its acted on.
> There are also different kinds of deletes in that there are explicit
> deletes of explicit cells (a new entry in hbase per cell to be
> deleted) and a column family delete which is a single entry at the
> start of a row for the deleted column family.
> I raise the above so you see that doing explicit deletes 'costs' more
> than TTL'ing.
cool, thanks.

> > Our system is asyncronous and we store > billions of pieces of data per
> day
> > and in such a system I could receive data from a mobile device today
> with a
> > timestamp from November (or whatever) because now is when the user
> > connected to the internet and also used the app I am receiving data for
> the
> > last time they used it but was not connected to the internet.
> >
> You want to keep the cell for 6 months since you 'saw' it -- if so,
> you could TTL it? -- or for 6 months after the event happened (For
> latter, the timestamp would be the event timestamp).
when the event happened so if we see something from November 3rd today then
we will only keep it for 4 more months (and for events that we see today
those stay for 6 months) .  so it sounds like this might be a viable option
and when we set the timestamp in our checkAndPut we make the timestamp be
the value that represents it as November 3rd, right? cool

> > So one thought I had was a table for each day this way I could delete
> > whenever i wanted to ... this seems like a bit of a nightmare, maybe by
> > month? or week? week feels better....
> >
> You could do that but sounds like the table-per-month would have data
> from outside of the month?  You'd be ok w/ this?   You'd need to
> figure how to do the x-months view.

well what i was thinking is that my client code would know to use the
november table and put the data in the november table (it is all just
strings) but I am leaning now towards the TTL option (need to futz with it
all more though).  One issue/concern with TTL is when all of a sudden we
want to keep things for only 4 months or maybe 8 months and then having to
re-TTL trillions of rows =8^( (which is nagging thought in the back of my
head about ttls, requirements change).... going the weekly route seems
viable too so i can figure out what week the vent occued to, save it to
that week and then keep the last 30 weeks trailing (or whatever... )... we
can also just say no, only moving forward (or cross that bridge when it

> > I guess I am also a little worried about having trillions of rows in a
> > table but maybe that is not an issue????  just dumping everything in one
> > mega table just does not feel right.
> >
> HBase deals in regions; it doesn't care if they are of one table or many.

That makes sense why a narrow long schema works well then, got it (I am use
to Cassandra and do lots of wide column range slices on those columns this
is like inverting everything up on myself but the row locks and checkAndPut
(and co-processors) hit so many of my uses cases (as Cassandra still does

> > So far my load tests are going well but there is a lot more to-go, I am
> > thinking of turning on bloomfilters (already have compression on) as I
> will
> > have lots of misses (most of the data 90%+ is NOT duplicate but real) a
> > bunch of other things I am learning as I go trying to iterate with each
> > change to our de-duplication code.  I have been really happy and
> impressed
> > so far with HBase, great job everyone and thanks!
> >
> I'd say don't do blooms till you have 0.92 up on your cluster (Are you
> 0.92'ing it or 0.90?).  They've been much improved in 0.92.

right now I am on 0.90.4 but right now I am going back and forth in
changing our hadoop cluster, HBase is the primary driver for that so I am
currently wrestling on the decision with upgrading from existing cluster
CDH2 to CDH3 or going with MapR ... my preference is to run my own version
of HBase (like I do with Kafka and Cassandra) I feel I can do this though I
am not comfortable with running my own Hadoop build (already overloaded
with things).  0.92 is exciting for co-processors too and it is cool system
to hack on, maybe I will pull from trunk build and test it out some too.

> > I guess my next step may just end up being to jump into the code so I can
> > get a better sense of these things but appreciate any help either in my
> > questions or pointing things through the code (being on the east coast I
> > feel thousands of miles away from the action and meetups and the rest but
> > look forward getting more into things).
> >
> Good on you Joe (You saw that I asked for your wiki name so I could
> add you as editor for hbase pages?)

No, I missed it but I see that now :) I updated the ticket for posterity
here is my wiki info thanks

wiki Name = Joe Stein
alias = charmalloc
email = cryptcom@gmail.com

> St.Ack


Joe Stein
Twitter: @allthingshadoop <http://twitter.com/#!/allthingshadoop>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message