hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anil gupta <anilgupt...@gmail.com>
Subject Re: Table/column layout
Date Sat, 11 Jun 2016 18:47:11 GMT
My 2 cents:

#1. HBase version timestamp is purely used for storing & purging historical
data on basis of TTL. If you try to build an app toying around timestamps
you might run into issues. So, you might need to be very careful with that.

#2. Usually HBase suggests that column name to be around 5-6 chars because
HBase store data as KV. But, its hard to keep on doing that in **real world
apps**. When you use block encoding/compression, the performance penalty of
wide columns is reduced. For example, Apache Phoenix uses Fast_Diff
encoding by default due to non-short column name.
Here is another blogpost that discuss perf of encoding/compression:
I have been using user friendly column names(more readable rather than
short abbreviation) and i still get decent performance in my
apps.(Obviously, YMMV. My apps are performing within our SLA.)
In prod, I have a table that has 1100+ columns, column names are not short.
Hence, i would recommend you to go ahead with your non-short column naming.
You might need to try out different encoding/compression to see what
provides you best performance.

Anil Gupta

On Fri, Jun 10, 2016 at 8:16 PM, Ken Hampson <hampsonk@gmail.com> wrote:

> I realize that was probably a bit of a wall of text... =)
> So, TL;DR: I'm wondering:
> 1) If people have used and had good experiences with caller-specified
> version timestamps (esp. given the caveats in the HBase book doc re: issues
> with deletions and TTLs).
> 2) About suggestions for optimal column naming for potentially large
> numbers of different column groupings for very wide tables.
> Thanks,
> - Ken
> On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson <hampsonk@gmail.com> wrote:
> > Hi:
> >
> > I'm currently using HBase 1.1.2 and am in the process of determining how
> > best to proceed with the column layout for an upcoming expansion of our
> > data pipeline.
> >
> > Background:
> >
> > Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is
> sha1
> > Table B: billions of rows (more than Table A), 1.8 TB (with snappy
> > compression), rowkey is sha1
> >
> >
> > These tables represent data obtained via a combination batch/streaming
> > process. We want to expand our data pipeline to run an assortment of
> > analyses on these tables (both batch and streaming) and be able to store
> > the results in each table as appropriate. Table A is a set of unique
> > entries with some example data, whereas Table B is correlated to Table A
> > (via Table A's sha1), but is not de-duplicated (that is to say, it
> contains
> > contextual data).
> >
> > For the expansion of the data pipeline, we want to store the data either
> > in Table A if context is not needed, and Table B if context is needed.
> > Since we have a theoretically unlimited number of different analyses that
> > we may want to perform and store the results for (that is to say, I need
> to
> > assume there will be a substantial number of data sets that need to be
> > stored in these tables, which will grow over time and could each
> themselves
> > potentially be somewhat wide in terms of columns).
> >
> > Originally, I had considered storing these in column families, where each
> > analysis is grouped together in a different column family. However, I
> have
> > read in the HBase book documentation that HBase does not  perform well
> with
> > many column families (a few default, ~10 max), so I have discarded this
> > option.
> >
> > The next two options both involve using wide tables with many columns in
> a
> > separate column family (e.g. "d"), where all the various analysis would
> be
> > grouped into the same family in a potentially wide amount of columns in
> > total. Each of these analyses needs to maintain their own versions so we
> > can correlate the data from each one. The variants which come to mind to
> > accomplish that, and on which I would appreciate some feedback on are:
> >
> >    1. Use HBase's native versioning to store the version of the analysis
> >    2. Encode a version in the column name itself
> >
> > I know the HBase native versions use the server's timestamp by default,
> > but can take any long value. So we could assign a particular time value
> to
> > be a version of a particular analysis. However, the doc also warned that
> > there could be negative ramifications of this because HBase uses the
> > versions internally for things like TTL for deletes/maintenance. Do
> people
> > use versions in this way? Are the TTL issues of great concern? (We likely
> > won't be deleting things often from the tables, but can't guarantee that
> we
> > won't ever do so).
> >
> > Encoding a version in the column name itself would make the column names
> > bigger, and I know it's encouraged for column names to be as small as
> > possible.
> >
> > Adjacent to the native-version-or-not question, there's the general
> column
> > naming. I was originally thinking maybe having a prefix followed by the
> > column name, optionally with the version in the middle depending on
> whether
> > 1 or 2 is chosen above. This would allow prefix filters to be used during
> > gets/scans to gather all columns for a given analysis type, etc. but it
> > would perhaps result in larger column names across billions of rows.
> >
> > e.g. *analysisfoo_4_column1*
> >
> > In practice, is this done and can it perform well? Or is it better to
> pick
> > a fixed width and use some number in its place, that's then translated
> via,
> > say, another table?
> >
> > e.g. *100000_1000_100000* (or something to that effect -- fixed width
> > numbers that are stand-in ids for potentially longer descriptions).
> >
> > Thanks,
> > - Ken
> >

Thanks & Regards,
Anil Gupta

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message