hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Francke <lars.fran...@gmail.com>
Subject Re: On the number of column families
Date Mon, 16 Jul 2018 08:07:38 GMT
Thanks Andrew for taking the time to answer in detail!

I have to admit that I didn't check the code for this one but I remember
these JIRAs:
https://issues.apache.org/jira/browse/HBASE-3149: "Make flush decisions per
column family"
https://issues.apache.org/jira/browse/HBASE-10201: "Port 'Make flush
decisions per column family' to trunk" (in 1.1 and 2.0)
So I assume that that's one thing that's been solved.

Good point about the open files, thanks! I didn't know the differences
between "normal" HDFS and other HDFS FS Implementations.

And thanks to the pointer about the Phoenix column encoding feature.


On Sat, Jul 14, 2018 at 2:21 AM, Andrew Purtell <apurtell@apache.org> wrote:

> I think flushes are still done by region in all versions, so this can lead
> to a lot of file IO depending on how well compaction can keep up. The CF is
> the unit of IO scheduling granularity. For a single row query where you
> don't select only a subset of CFs, then each CF adds IO demand with
> attendant impact. The flip side to this is if you segregate subsets of data
> that are separately accessed into a CF for each subset, and use queries
> with high CF selectivity, then this optimizes IO to your query. This kind
> of "manual" query planning is an intended benefit (and burden) of the
> bigtable data model.
>
> Because HBase currently holds open a reference to all files in a store,
> there is some modest linear increase in heap demand as the number of CFs
> grows. HDFS does a good job of multiplexing the notion of open file over a
> smaller set of OS level resources. Other filesystem implementations (like
> the S3 family) do not, so if you have a root FS on S3 then as the number of
> files in the aggregate goes up so does resource demand at the OS layer, and
> you might have issues with hitting open file descriptor limits. There are
> some JIRAs open that propose changes to this. (I filed them.)
>
> If you use Phoenix, like we do, then if you turn on Phoenix's column
> encoding feature, PHOENIX-1598 (
> https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data)
> then no matter how many logical columns you have in your schema they are
> mapped to a single CF at the HBase layer, which produces some space and
> query time benefits (and has some tradeoffs). So where I work the ideal is
> one CF, although because we have legacy tables it is not universally
> applied.
>
>
> On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <lars.francke@gmail.com>
> wrote:
>
> > I've got a question on the number of column families. I've told everyone
> > for years that you shouldn't use more than maybe 3-10 column families.
> >
> > Our book still says the following:
> > "HBase currently does not do well with anything above two or three column
> > families so keep the number of column families in your schema low.
> > Currently, *flushing* and compactions are done on a per Region basis so
> if
> > one column family is carrying the bulk of the data bringing on flushes,
> the
> > adjacent families will also be flushed even though the amount of data
> they
> > carry is small."
> >
> > I'm wondering what the state of the art _really_ is today.
> >
> > I know that flushing happens per CF. As far as I can tell though
> > compactions still happen for all stores in a region after a flush.
> >
> > Related question there (there's always a good chance that I misread the
> > code): Wouldn't it make sense to make the compaction decision after a
> flush
> > also per Store?
> >
> > But back to the original question. How many column families do you see
> > and/or use in production? And what are the remaining reasons against "a
> > lot"?
> >
> > My list is the following:
> > - Splits happen per region, so small CFs will be split to be even smaller
> > - Each CF takes up a few resources even if they are not in use (no reads
> or
> > writes)
> > - If each CF is used then there is an increased total memory pressure
> which
> > will probably lead to early flushes which leads to smaller files which
> > leads to more compactions etc.
> > - As far as I can tell (but I'm not sure) when a single Store/CF answers
> > "yes" to the "needsCompaction()" call after a flush the whole region will
> > be compacted
> > - Each CF creates a directory + files per region -> might lead to lots of
> > small files
> >
> > I'd love to update the book when I have some answers.
> >
> > Thank you!
> >
> > Cheers,
> > Lars
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message