hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: When to expand vertically vs. horizontally in Hbase
Date Fri, 05 Jul 2013 17:48:18 GMT
Sorry, but you missed the point. 

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema
design yet for some reason... it just doesn't seem important enough or sexy enough... maybe
if I worked for Cloudera/Intel/etc ...  ;-) 

Look, 

The issue is what is and how to use Column families. 

Since they are a separate HFile that uses the same key, the question is why do you need it
and when do you want to use it. 

The answer unfortunately is a bit more complicated than the questions. 

You have to ask yourself when do you have a series of tables which have the same key value?

How do you access this data? 

It gets more involved, but just looking at the answers to those two questions is a start.


Like I said, think about the order entry example and how the data is used in those column
families. 

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last
part, but its a very important concept. You need to stop thinking in terms of ERD when there
is no relationship. Column families tend to create a weak relationship... which makes them
a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <aji1705@gmail.com> wrote:

> I understand that there shouldn't be unlimited number of column families. I
> am using this example on purpose to see how it comes into play.
> 
> 
> On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <michael_segel@hotmail.com>wrote:
> 
>> Why do you have so many column families (CF) ?
>> 
>> Its not a question on the physical limitations, but more on the issue of
>> data design.
>> 
>> There aren't that many really good examples of where you would have
>> multiple column families that would require more than a handful of CFs.
>> 
>> When I teach or lecture, the example I use is an order entry system.
>> Where you would have the same key on Order entry, pick slips, shipping,
>> and invoice.
>> 
>> That's probably the best example of where CFs come in to play.
>> 
>> I'd suggest that you go back and rethink the design if you're having more
>> than a handful.
>> 
>> 
>> 
>> On Jul 5, 2013, at 8:53 AM, Aji Janis <aji1705@gmail.com> wrote:
>> 
>>> Asaf,
>>> 
>>> I am using the Genre/Author stuff as an example but yes at the moment I
>>> only have 5 column families. However, over time I may have more (no upper
>>> limit decided that this point). See below for more responses
>>> 
>>> 
>>> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <asaf.mesika@gmail.com>
>> wrote:
>>> 
>>>> Do you have only 5 static author names?
>>>> Keep in mind the column family name is defined when creating the table.
>>>> 
>>>> Regarding tall vs wide debate:
>>>> HBase is first and for most a Key Value database thus reads and writes
>> in
>>>> the column-value level. So it doesn't really care about rows.
>>>> But it's not entirely true. Rows come into play in the following
>>>> situations:
>>>> Splitting a region is per row and not per column, thus a row will be
>> saved
>>>> as a whole on a region. If you have a really large row, the region size
>>>> granularity is dependent on it. It doesn't seem to be the case here.
>>>> Put/Delete creates a lock until finished. If you are intensive on
>> inserts
>>>> to the same row at the same time, thus might be bad for you, keeping
>> your
>>>> rows slimmer can reduce contention, but again, only if you make a lot
>>>> concurrent modifications to the same row.
>>>> 
>>> 
>>> I expect batches of Put/Delete to the same row to happen by at most one
>>> thread at a time based on user's current behavior. So locking shouldn't
>> be
>>> an issue. However, not sure if the saving row to a region with enough
>> space
>>> topic is really an issue I need to worry about (probably because I just
>>> don't know much about it yet).
>>> 
>>> 
>>>> Filtering - if you need a filter which need all the row (there is a
>> method
>>>> you override in Filter to mark that) than a far row will be more memory
>>>> intensive. If you needed only 1/5 of your row, than maybe splitting it
>> to 5
>>>> rows to begin with would have made a better schema design in terms of
>>>> memory and I/O.
>>>> 
>>> 
>>> Currently, my access pattern is to get all data for a given row. Its
>>> possible in the future we may want to apply (family/qualifier) filters.
>>> There is a lot of uncertainty on use cases (client side) at this point
>>> which is why I am not entirely sure on how things will look months from
>>> now. I am not sure I follow this statement
>>> 
>>> "if you need a filter which need all the row (there is a method you
>>> override in Filter to mark that) than a far row will be more memory
>>> intensive."
>>> 
>>> Can you please explain? Thank you for these suggestions btw, good food
>> for
>>> thought!
>>> 
>>> 
>>>> 
>>>> On Wednesday, July 3, 2013, Aji Janis wrote:
>>>> 
>>>>> I have a major typo in the question so I apologize. I meant to say 5
>>>>> families with 1000+ qualifiers each.
>>>>> 
>>>>> Lets work with an example, (not the greatest example here but still).
>>>> Lets
>>>>> say we have a Genre Class like this:
>>>>> 
>>>>> Class HistoryBooks{
>>>>> 
>>>>> ArrayList<Books> author1;
>>>>> ArrayList<Books> author2;
>>>>> ArrayList<Books> author3;
>>>>> ArrayList<Books> author4;
>>>>> ArrayList<Books> author5;
>>>>> 
>>>>> ...}
>>>>> 
>>>>> Each author is a column family (lets say we only allow 5 authors per
>>>>> <T>Book class. Book per author ends up being the qualifier. In
this
>>>> case, I
>>>>> know I have a max family count but my qualifiers have no upper limit.
>> So
>>>> is
>>>>> this scenario a case for tall or wide table? Why? Thank you.
>>>>> 
>>>>> 
>>>>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
>>>>> <bbeaudreault@hubspot.com <javascript:;>>wrote:
>>>>> 
>>>>>> If they are accessed mostly together they should all be a single
>> column
>>>>>> family. The key with tall or wide is based on the total byte size
of
>>>> each
>>>>>> KeyValue. Your cells would need to be quite large for 50 to become
a
>>>>>> problem. I still would recommend using a single CF though.
>>>>>> —
>>>>>> Sent from iPhone
>>>> 
>> 
>> 


Mime
View raw message