hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Yalowitz <neilyalow...@gmail.com>
Subject bulk import and counting increments
Date Thu, 12 Jan 2012 19:32:24 GMT
Hi all,

When performing a bulk import into HBase, what methods are available to
increment a counter?  To describe the problem: a large dataset comes in,
and the most efficient way to get that data into an HBase table is to bulk
load, as described here:


The stumbling block arises when a counter needs to be maintained that
relates to the imported data.  For our use case, each row of the inputfile
is a user log hit, but we need to maintain a counter of how many hits we
have accrued for each individual user so a separate job can take action if
the "hits" exceed a certain threshold.

Our current implementation does not use bulk import for this reason...
instead, it uses an HTable.put() with batched flushes and subsequent
incrementColumnValue() which is very slow.

An alternate idea was to bulk import the data and utilize the version count
as a makeshift increment, but the followup job of "find rows where versions
> 3" would result in a full table scan since there is no way to filter a
scan on "number of versions > x" (as far as I know).

Any ideas?  What techniques are other users utilizing to solve this problem?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message