hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Hampson <hamps...@gmail.com>
Subject Table/column layout
Date Wed, 08 Jun 2016 02:52:49 GMT

I'm currently using HBase 1.1.2 and am in the process of determining how
best to proceed with the column layout for an upcoming expansion of our
data pipeline.


Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
Table B: billions of rows (more than Table A), 1.8 TB (with snappy
compression), rowkey is sha1

These tables represent data obtained via a combination batch/streaming
process. We want to expand our data pipeline to run an assortment of
analyses on these tables (both batch and streaming) and be able to store
the results in each table as appropriate. Table A is a set of unique
entries with some example data, whereas Table B is correlated to Table A
(via Table A's sha1), but is not de-duplicated (that is to say, it contains
contextual data).

For the expansion of the data pipeline, we want to store the data either in
Table A if context is not needed, and Table B if context is needed. Since
we have a theoretically unlimited number of different analyses that we may
want to perform and store the results for (that is to say, I need to assume
there will be a substantial number of data sets that need to be stored in
these tables, which will grow over time and could each themselves
potentially be somewhat wide in terms of columns).

Originally, I had considered storing these in column families, where each
analysis is grouped together in a different column family. However, I have
read in the HBase book documentation that HBase does not  perform well with
many column families (a few default, ~10 max), so I have discarded this

The next two options both involve using wide tables with many columns in a
separate column family (e.g. "d"), where all the various analysis would be
grouped into the same family in a potentially wide amount of columns in
total. Each of these analyses needs to maintain their own versions so we
can correlate the data from each one. The variants which come to mind to
accomplish that, and on which I would appreciate some feedback on are:

   1. Use HBase's native versioning to store the version of the analysis
   2. Encode a version in the column name itself

I know the HBase native versions use the server's timestamp by default, but
can take any long value. So we could assign a particular time value to be a
version of a particular analysis. However, the doc also warned that there
could be negative ramifications of this because HBase uses the versions
internally for things like TTL for deletes/maintenance. Do people use
versions in this way? Are the TTL issues of great concern? (We likely won't
be deleting things often from the tables, but can't guarantee that we won't
ever do so).

Encoding a version in the column name itself would make the column names
bigger, and I know it's encouraged for column names to be as small as

Adjacent to the native-version-or-not question, there's the general column
naming. I was originally thinking maybe having a prefix followed by the
column name, optionally with the version in the middle depending on whether
1 or 2 is chosen above. This would allow prefix filters to be used during
gets/scans to gather all columns for a given analysis type, etc. but it
would perhaps result in larger column names across billions of rows.

e.g. *analysisfoo_4_column1*

In practice, is this done and can it perform well? Or is it better to pick
a fixed width and use some number in its place, that's then translated via,
say, another table?

e.g. *100000_1000_100000* (or something to that effect -- fixed width
numbers that are stand-in ids for potentially longer descriptions).

- Ken

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message