hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Hampson <hamps...@gmail.com>
Subject Re: Table/column layout
Date Sat, 11 Jun 2016 03:16:12 GMT
I realize that was probably a bit of a wall of text... =)

So, TL;DR: I'm wondering:
1) If people have used and had good experiences with caller-specified
version timestamps (esp. given the caveats in the HBase book doc re: issues
with deletions and TTLs).

2) About suggestions for optimal column naming for potentially large
numbers of different column groupings for very wide tables.

Thanks,
- Ken

On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson <hampsonk@gmail.com> wrote:

> Hi:
>
> I'm currently using HBase 1.1.2 and am in the process of determining how
> best to proceed with the column layout for an upcoming expansion of our
> data pipeline.
>
> Background:
>
> Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
> Table B: billions of rows (more than Table A), 1.8 TB (with snappy
> compression), rowkey is sha1
>
>
> These tables represent data obtained via a combination batch/streaming
> process. We want to expand our data pipeline to run an assortment of
> analyses on these tables (both batch and streaming) and be able to store
> the results in each table as appropriate. Table A is a set of unique
> entries with some example data, whereas Table B is correlated to Table A
> (via Table A's sha1), but is not de-duplicated (that is to say, it contains
> contextual data).
>
> For the expansion of the data pipeline, we want to store the data either
> in Table A if context is not needed, and Table B if context is needed.
> Since we have a theoretically unlimited number of different analyses that
> we may want to perform and store the results for (that is to say, I need to
> assume there will be a substantial number of data sets that need to be
> stored in these tables, which will grow over time and could each themselves
> potentially be somewhat wide in terms of columns).
>
> Originally, I had considered storing these in column families, where each
> analysis is grouped together in a different column family. However, I have
> read in the HBase book documentation that HBase does not  perform well with
> many column families (a few default, ~10 max), so I have discarded this
> option.
>
> The next two options both involve using wide tables with many columns in a
> separate column family (e.g. "d"), where all the various analysis would be
> grouped into the same family in a potentially wide amount of columns in
> total. Each of these analyses needs to maintain their own versions so we
> can correlate the data from each one. The variants which come to mind to
> accomplish that, and on which I would appreciate some feedback on are:
>
>    1. Use HBase's native versioning to store the version of the analysis
>    2. Encode a version in the column name itself
>
> I know the HBase native versions use the server's timestamp by default,
> but can take any long value. So we could assign a particular time value to
> be a version of a particular analysis. However, the doc also warned that
> there could be negative ramifications of this because HBase uses the
> versions internally for things like TTL for deletes/maintenance. Do people
> use versions in this way? Are the TTL issues of great concern? (We likely
> won't be deleting things often from the tables, but can't guarantee that we
> won't ever do so).
>
> Encoding a version in the column name itself would make the column names
> bigger, and I know it's encouraged for column names to be as small as
> possible.
>
> Adjacent to the native-version-or-not question, there's the general column
> naming. I was originally thinking maybe having a prefix followed by the
> column name, optionally with the version in the middle depending on whether
> 1 or 2 is chosen above. This would allow prefix filters to be used during
> gets/scans to gather all columns for a given analysis type, etc. but it
> would perhaps result in larger column names across billions of rows.
>
> e.g. *analysisfoo_4_column1*
>
> In practice, is this done and can it perform well? Or is it better to pick
> a fixed width and use some number in its place, that's then translated via,
> say, another table?
>
> e.g. *100000_1000_100000* (or something to that effect -- fixed width
> numbers that are stand-in ids for potentially longer descriptions).
>
> Thanks,
> - Ken
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message