hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: On the number of column families
Date Sat, 14 Jul 2018 00:21:50 GMT
I think flushes are still done by region in all versions, so this can lead
to a lot of file IO depending on how well compaction can keep up. The CF is
the unit of IO scheduling granularity. For a single row query where you
don't select only a subset of CFs, then each CF adds IO demand with
attendant impact. The flip side to this is if you segregate subsets of data
that are separately accessed into a CF for each subset, and use queries
with high CF selectivity, then this optimizes IO to your query. This kind
of "manual" query planning is an intended benefit (and burden) of the
bigtable data model.

Because HBase currently holds open a reference to all files in a store,
there is some modest linear increase in heap demand as the number of CFs
grows. HDFS does a good job of multiplexing the notion of open file over a
smaller set of OS level resources. Other filesystem implementations (like
the S3 family) do not, so if you have a root FS on S3 then as the number of
files in the aggregate goes up so does resource demand at the OS layer, and
you might have issues with hitting open file descriptor limits. There are
some JIRAs open that propose changes to this. (I filed them.)

If you use Phoenix, like we do, then if you turn on Phoenix's column
encoding feature, PHOENIX-1598 (
https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data)
then no matter how many logical columns you have in your schema they are
mapped to a single CF at the HBase layer, which produces some space and
query time benefits (and has some tradeoffs). So where I work the ideal is
one CF, although because we have legacy tables it is not universally
applied.


On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <lars.francke@gmail.com> wrote:

> I've got a question on the number of column families. I've told everyone
> for years that you shouldn't use more than maybe 3-10 column families.
>
> Our book still says the following:
> "HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
> Currently, *flushing* and compactions are done on a per Region basis so if
> one column family is carrying the bulk of the data bringing on flushes, the
> adjacent families will also be flushed even though the amount of data they
> carry is small."
>
> I'm wondering what the state of the art _really_ is today.
>
> I know that flushing happens per CF. As far as I can tell though
> compactions still happen for all stores in a region after a flush.
>
> Related question there (there's always a good chance that I misread the
> code): Wouldn't it make sense to make the compaction decision after a flush
> also per Store?
>
> But back to the original question. How many column families do you see
> and/or use in production? And what are the remaining reasons against "a
> lot"?
>
> My list is the following:
> - Splits happen per region, so small CFs will be split to be even smaller
> - Each CF takes up a few resources even if they are not in use (no reads or
> writes)
> - If each CF is used then there is an increased total memory pressure which
> will probably lead to early flushes which leads to smaller files which
> leads to more compactions etc.
> - As far as I can tell (but I'm not sure) when a single Store/CF answers
> "yes" to the "needsCompaction()" call after a flush the whole region will
> be compacted
> - Each CF creates a directory + files per region -> might lead to lots of
> small files
>
> I'd love to update the book when I have some answers.
>
> Thank you!
>
> Cheers,
> Lars
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message