hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: 1 table, 1 dense CF => N tables, 1 dense CF ?
Date Fri, 09 Jan 2015 20:12:42 GMT
Otis:
You can find examples of how these methods are used in Phoenix.
Namely:
phoenix-core//src/main/java/org/apache/hadoop/hbase/regionserver/IndexHalfStoreFileReaderGenerator.java
phoenix-core//src/main/java/org/apache/phoenix/coprocessor/UngroupedAggregateRegionObserver.java
phoenix-core//src/main/java/org/apache/phoenix/hbase/index/Indexer.java

FYI

On Fri, Jan 9, 2015 at 12:03 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> I haven't written against this API yet, so I don't know all these answers
> off the top of my head. The interface you're interested in are the
> preCompact* methods in
>
> http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html
>
> On Fri, Jan 9, 2015 at 6:35 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com
> > wrote:
>
> > Hi,
> >
> > What Nick suggests below about using Compaction Coprocessor sounds
> > potentially very useful for us.  Q below.
> >
> > On Wed, Jan 7, 2015 at 8:21 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> >
> > > Not to dig too deep into ancient history, but Tsuna's comments are
> mostly
> > > still relevant today, except for...
> > >
> > > You also generally end up with fewer, bigger regions, which is almost
> > > > always better.  This entails that your RS are writing more data to
> > fewer
> > > > WALs, which leads to more sequential writes across the board.  You'll
> > end
> > > > up with fewer HLogs, which is also a good thing.
> > >
> > >
> > > HBase is one WAL per region server and has been for as long as I've
> paid
> > > attention. Unless I've missed something, number of tables doesn't
> change
> > > this fixed number.
> > >
> > > If you use HBase's client (which is most likely the case as the only
> > other
> > > > alternative is asynchbase), beware that you need to create one HTable
> > > > instance per table per thread in your application code.
> > >
> > >
> > > You can still write your client application this way, but the preferred
> > > idiom is to use a single Connection instance from which all these
> > resources
> > > are shared across HTable instances. This pattern is reinforced in the
> new
> > > client API introduced in 1.0
> > >
> > > FYI, I think you can write a Compaction coprocessor that implements
> your
> > > data expiration policy through normal compaction operations, thereby
> > > removing the necessity of the (expensive?) scan + write delete pattern
> > > entirely.
> > >
> >
> > We actually do 2 types of full scans:
> > 1) scan everything and delete rows > N days old, where N can be different
> > for different users
> > 2) scan everything and merge multiple rows into 1 row via HBaseHUT -
> > https://github.com/sematext/HBaseHUT
> >
> > 2) is more expensive than 1).
> > I'm wondering if we could use Compaction Coprocessor for 2)?  HBaseHUT
> > needs to be able to grab N rows and merge them into 1, delete those N
> rows,
> > and just write that 1 new row.  This N could be several thousand rows.
> > Could Compaction Coprocessor really be used for that?
> >
> > Also, would that come into play during minor or major compactions or
> both?
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> >
> > >
> > > -n
> > >
> > > On Wed, Jan 7, 2015 at 9:27 AM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > It's been asked before, but I didn't find any *definite* answers and
> a
> > > lot
> > > > of answers I found via  are from a whiiiile back.
> > > >
> > > > e.g. Tsuna provided pretty convincing info here:
> > > >
> > > >
> > >
> >
> http://search-hadoop.com/m/xAiiO8ttU2/%2522%2522I+generally+recommend+to+stick+to+a+single+table%2522&subj=Re+One+table+or+multiple+tables+
> > > >
> > > > ... but that is from 3 years ago.  Maybe things changed?
> > > >
> > > > Here's our use case:
> > > >
> > > > Data/table layout:
> > > > * HBase is used for storing metrics at different granularities
> (1min, 5
> > > > min.... - a total of 6 different granularities)
> > > > * It's a multi-tenant system
> > > > * Keys are carefully crafted and include userId + number, where this
> > > number
> > > > contains the time and the granularity
> > > > * Everything's in 1 table and 1 CF
> > > >
> > > > Access:
> > > > * We only access 1 system at a time, for a specific time range, and
> > > > specific granularity
> > > > * We periodically scan ALL data and delete data older than N days,
> > where
> > > N
> > > > varies from user to user
> > > > * We periodically scan ALL data and merge multiple rows (of the same
> > > > granularity) into 1
> > > >
> > > > Question:
> > > > Would there be any advantage in having 6 tables - one for each
> > > granularity
> > > > - instead of having everything in 1 table?
> > > > Assume each table would still have just 1 CF and the keys would
> remain
> > > the
> > > > same.
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> > > > Solr & Elasticsearch Support * http://sematext.com/
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message