hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Distinct counters and counting rows
Date Wed, 30 May 2012 22:39:25 GMT
I should add that getting an exact count at open time would be expensive
and probably not necessary.

On Wednesday, May 30, 2012, Andrew Purtell wrote:

> A common question about HBase is if statistics on row index cardinality
> are maintained.
> The short answer is no, because in some sense each HBase table region is
> its own database, and each region is partly in memory and partly (log
> structured) on disk, including perhaps tombstones, so discovering the count
> of all unique keys in the full table requires the client iterate over all
> rows in all regions. Only then might all live row keys be found.
> However as others have mentioned the coprocessor framework can help
> someone implement fast counting. When a region is first opened all data is
> in HFiles and each HFile knows the number of keys within it (though not
> unique keys at the moment). So a coprocessor could add new metadata (a
> unique row key count) to HFiles when writing them, at flush and compaction
> times. And then load and sum such counts at region open time. And then
> maintain a probabilistic count at runtime using available blooms as new
> entries are stored into the Memstore*. The exact count would be available
> again upon the next open.
> *- Though offhand I'm not sure what to do about deletes.
> If someone does end up implementing something like this, please consider
> contributing it back because it's not uncommonly discussed.
>     - Andy
> On Wednesday, May 30, 2012, Ramkrishna.S.Vasudevan wrote:
>> To answer this question
>> Alternatively, is there a way to trigger an increment in another table
>> (say
>> "count") whenever a row was added to "user"?
>> You can try to use Coprocessors here.  Like once a put is done to the
>> table
>> 'user' using the coprocessor hooks you can trigger an Increment()
>> operation
>> on table 'count'.
>> This can be done on one call from client.  Also the increment() operation
>> guarantees atomicity.
>> Hope this helps.
>> Regards
>> Ram
>> > -----Original Message-----
>> > From: David Koch [mailto:ogdude@googlemail.com]
>> > Sent: Wednesday, May 30, 2012 12:47 PM
>> > To: user@hbase.apache.org
>> > Subject: Distinct counters and counting rows
>> >
>> > Hello,
>> >
>> > I am testing HBase for distinct counters - more concretely, counting
>> > unique users from a fairly large stream of user_ids. For some time to
>> > come the volume will be limited enough to use exact counting rather
>> > than approximation but already it's too big to hold the entire set of
>> > user_ids in memory.
>> >
>> > For now I am basically inserting all elements from the stream into a
>> > "user" table which has row key "user_id" as to enforce the unique
>> > constraint.
>> >
>> > My question:
>> > a) Is there a way to get a quick (i.e with small delay in a user
>> > interface) count of the size of the user table to return the number of
>> > users? Alternatively, is there a way to trigger an increment in
>> > another table (say "count") whenever a row was added to "user"? I
>> > guess this can be picked up eventually by the client application but I
>> > don't want this to delay the actual stream processing.
>> > b) I heard about Bloom filters in HBase but failed to understand if
>> > they are used for row keys as well. Are they? How do I activate it? I
>> > was looking to reduce the work-load of checking set membership for
>> > every user_id in the stream. If this is done by HBase internally even
>> > better.
>> > c) Eventually, I want to store distinct users by day and then do
>> > unions on different days to get the total amount of unique users for a
>> > multi-day period. Is this likely to involve a Map Reduce or is there a
>> > more "light-weight" approach?
>> >
>> > Thank you,
>> >
>> > /David

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message