hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Parent/child relation - go vertical, horizontal, or many tables?
Date Fri, 11 Feb 2011 20:59:20 GMT

Jonathan,
Thanks for the response.
> The fact that a row cannot cross a region boundary is a 
consideration, but unless your rows will be many gigabytes each, I don't
 think this is that important.  Having to cross a region boundary to 
fulfill the "get all children" query would be my primary worry.

That would be an issue if you have a tall table with many rows. Assuming you had enough children
to break the wide row and the children were relatively big...

> Now besides those considerations above, the other two queries you 
want (parent-child point lookups and parent-child additions) are 
virtually identical in performance on the server-side starting with 
HBase 0.90 and beyond.  We have the same block-seeking optimizations in 
both schemas for the read case, and the write case is identical in both.

This is interesting.

So essentially the pat response these days is either "... it depends..." or "YMMV". 

Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still
lean to wide rows... 
But it is good to know about the improvements in 0.90

Thx

-Mike


> From: jgray@fb.com
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> Date: Fri, 11 Feb 2011 20:48:51 +0000
> 
> Just to chime in with my usual take on this (seems like the tall vs. wide discussion
happens every few weeks...)
> 
> For "get all children of a parent", doing a get() on the wide table vs. doing a scan()
on the tall table (as long as you set scanner caching appropriately) will be almost identical.
 I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT*
that today a Scan will always require more than one RPC because the API is such that you need
to open the scanner first, and then do next() on it, and then close() it.  This is a current
API limitation but we could implement an optimization to allow for single-RPC scans if the
query can be fulfilled in a single response (start row, stop row, and scanner caching set
appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC
(it opens a scanner, next() on it, and then close() it).
> 
> The fact that a row cannot cross a region boundary is a consideration, but unless your
rows will be many gigabytes each, I don't think this is that important.  Having to cross a
region boundary to fulfill the "get all children" query would be my primary worry.
> 
> Now besides those considerations above, the other two queries you want (parent-child
point lookups and parent-child additions) are virtually identical in performance on the server-side
starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both
schemas for the read case, and the write case is identical in both.
> 
> The only other thing to consider is what if all the children of one parent can't fit
in memory at the same time.  This is not at all related to a region getting too big (there
is no requirement of fitting a  region into memory) but is a consideration for reading it
in a single RPC (both on the server-side and also receiving it in your client).  However,
you would deal with this the same way in the tall or wide case.  In the tall case, you would
appropriately set the scanner caching number.  In the wide case, you would set the intra-row
scan limit.  In this case, you will be forced to use the Scan API regardless because if you
need multiple RPCs for a single row, you need the Scanner next() semantics.
> 
> Many times, this decisions comes to a matter of personal preference.  I lean towards
wide tables these days unless I expect extremely high numbers of children (so I want to split
across regions and RPC requests) and I expect to frequently run the get-all-children query
with high numbers of children.
> 
> JG
> 
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Friday, February 11, 2011 12:23 PM
> > To: user@hbase.apache.org
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > 
> > 
> > David,
> > 
> > First a caveat... You need to have a realistic notion of the data and its sizes
> > when considering your options...
> > With respect to the response, Here's what I said:
> > -=-
> > "With respect to your issue about a row being too large to fit in to memory...
> >  This would imply that the row would be too large to fit in to a single region.
> > Wouldn't that cause your HBase to die a horrible death?
> > 
> >  If this really is a potential situation, then you should consider the
> > parent_key, child_id compound row key..."
> > -=-
> > Now a correction. If you insert a row that is larger than a region, the region
> > will grow to fit the row and will not split. So until your row exceeds the size
of
> > available disk... you can do it. So yeah you could fill up memory...
> > 
> > And that's the only reason why I would recommend option 2 over option 1.
> > So how real is this scenario?
> > 
> > Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
> > you the entire set of children for the parent in a single fetch.
> > If you limit the columns to either a single column or a set of columns, you are
> > still going to be a single get().
> > 
> > This is going to be faster than doing a scan() on a series of row starting with
> > parent_id stopping with parent_id+1.
> > (At least in theory. I haven't mocked this out and tried it.)
> > 
> > Again the only advantage of option 2 is if you really are worried about your
> > data size blowing you out of the water.
> > If you do find yourself using a lot of memory to fetch your edge cases, then
> > you'd be better off with the second option.
> > 
> > Here you have the following:
> > 
> > 1) Fetching all of the children (scan() with a start and stop key)
> > 2) Fetching some of the rows... (scan() with a start and stop key and some
> > sort of filter);
> > 3) Fetching single child (get() using a combination of parent_id, child_id for
> > the key.)
> > 
> > So while you don't have to worry about the size of a row, you do not get the
> > same performance that you could with option 1.
> > 
> > Does that make sense?
> > 
> > -Mike
> > 
> > 
> > 
> > 
> > 
> > > From: buttler1@llnl.gov
> > > To: user@hbase.apache.org
> > > Date: Fri, 11 Feb 2011 10:45:14 -0800
> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > >
> > > Michael,
> > > Thanks for the analysis.  The thought process you put into this seems
> > useful.  However, following along at home I came to a different conclusion
> > than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention,
> > but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you
> > mention.
> > >
> > > So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
> > would be very wasteful for use cases (u2) and (u3). The only time it would
> > help is in (u1).  And then it doesn't seem obvious to me that a single row is
> > better except in cases where there are very few children per parent.
> > >
> > > Perhaps if the data is expected to have a power law distribution (fat tail,
> > zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
> > that has fewer than (say 10) children.  But, after a parent fills up its first 10
> > children, start populating rows like (sol. 2).
> > >
> > > This would definitely make the client code more complex, so it would only
> > make sense if there were huge savings to be had.
> > > Maybe a slightly better implementation of the hybrid would be to divide
> > the child key space up into buckets so that you can directly address any child,
> > but still have fewer calls in retrieving all children.  Then you can adjust your
> > bucket size based on your actual use case (with a bucket size of 1 being the
> > special case of (sol. 2)).
> > >
> > > But the more I think about it, the more I suspect that the added complexity
> > will not be worth it, and he should just go with (sol. 2).
> > >
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > > Sent: Friday, February 11, 2011 5:51 AM
> > > To: user@hbase.apache.org
> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > >
> > >
> > > Jason,
> > >
> > > You have the following constraint:
> > > Foreach child there is one parent. A parent can have more than one child.
> > >
> > > While you don't specify size of the child, when a parent can have tens of
> > millions, that could become an issue.
> > > Assuming that the child is relatively small...
> > >
> > > You have 3 use cases: (Scan patterns)
> > >
> > > > -Fetch all children from a single parent -Find a few children by
> > > > their keys or values from a single parent -Update a single child by
> > > > child key and it's parent key
> > >
> > > Your options...
> > >
> > > > 1. One table with one Parent per row. Row key is a parent id.
> > > Children are stored in a single family each under separate qualifier
> > > (child id). Would it even work assuming all children may not fit in
> > > memory?
> > > >
> > > While you raise an interesting point, lets look at the schema as a solution.
> > > This works well because you can fetch the entire row based on parent key.
> > > So all queries are get()s and not scan()s.
> > >
> > > You can then pull all of the existing columns where each column represents
> > a child.
> > >
> > > You can also do a get() of only those columns you want based on child_id as
> > the column name.
> > >
> > > You can also do a get() or a put of a specific column (child_id) for a given
> > parent (row key).
> > >
> > >
> > > With respect to your issue about a row being too large to fit in to memory...
> > > This would imply that the row would be too large to fit in to a single region.
> > Wouldn't that cause your HBase to die a horrible death?
> > >
> > > If this really is a potential situation, then you should consider the
> > parent_key, child_id compound row key...
> > >
> > > > 2. One table. Compound row key parent id/child id. One child per row.
> > > >
> > > Based on your use cases, I wouldn't recommend this. While it is a valid
> > schema, it is only 'optimal' for your 'Update a single child by child key and its
> > parent key'.
> > >
> > > > 3. Many tables - one per parent. Row key is a child id.
> > > If you have a scenario of a parent has billions+ of children, the
> > > could be a valid choice, however based on what you said, (up to tens
> > > of millions) and the data set is unique and non-intersecting, you
> > > would be better off with a single table. (Too many tables is not a
> > > good thing in HBase.)
> > >
> > >
> > > HTH
> > >
> > > -Mike
> > >
> > >
> > > > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > > > From: urgisb@gmail.com
> > > > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi all,
> > > >
> > > > Let's say I have two entities Parent and Child. There could be many
> > > > children in one parent (from hundreds to tens of millions) A child can
only
> > belong to one Parent.
> > > >
> > > > Typical queries are:
> > > > -Fetch all children from a single parent -Find a few children by
> > > > their keys or values from a single parent -Update a single child by
> > > > child key and it's parent key
> > > >
> > > > And there are no cross-parent queries.
> > > >
> > > > I am trying to figure out what is better schema approach from
> > performance/maintenance perspective:
> > > >
> > > > 1. One table with one Parent per row. Row key is a parent id. Children
are
> > stored in a single family each under separate qualifier (child id). Would it
> > even work assuming all children may not fit in memory?
> > > >
> > > > 2. One table. Compound row key parent id/child id. One child per row.
> > > >
> > > > 3. Many tables - one per parent. Row key is a child id.
> > > >
> > > > Thanks!
> > >
> > 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message