hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to limit columns returned by a single row in HBase
Date Sat, 19 Jul 2014 20:27:40 GMT
You can write your own filter, based on ColumnCountGetFilter, by not
overriding filterAllRemaining() method.

In filterKeyValue() method, when count is bigger than limit, the method
returns NEXT_ROW.

Your filter can remember the file prefix of the previous row. If file
prefix of current row is the same as that of the previous row, return
NEXT_ROW from filterRowKey().

Cheers


On Sat, Jul 19, 2014 at 8:23 AM, SiMaYunRui <mylpis@hotmail.com> wrote:

> Hi experts,
>
>
>
> I have a wide-flat table, and during scan, how can I limit columns
> returned by a single row, instead of all rows (what ColumnCountGetFilter
> does)? Because I need to scan multiple rows at the same time, and in client
> side to do aggregation.
>
> Put more background, I am designing an auditing tools, which records
> pattern of “(who) operates against (what) at (when)”. The search pattern is
> like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list all
> files (what part, start-with search) be operated in DESC order of (when).
>
> I have tens of millions of records per day, and keep them 30 - 90 days. So
> I am thinking about two designs: a) rowkey as (file name)_(reverse of
> when), problem is that people want to use start-wth search to match
> multiple files, in this way, scan has to go thru all matches files, which
> could be huge and then client has to re-order them to display 500 records
> on top; It could be very slow;
>
> b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to
> day to partition)). qualifier is (reverse of when). This design can
> leverage the fact that qualifiers are in order to make fewer search than #a
> in my personal opinion. But I cannot put all operations on a single file in
> one row, because total number might exceeds multiple millions.
>
> So I am thinking of grouping data into the following shape by using #b.
> Then back to my original question, because I only need 500 records, if the
> row (file A)_(2014/06/14), contains more than that number, can I stop it
> and then continue to scan next row? And if I already get enough in (file
> A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to scan
> (file B)_(2014/06/14) which is a different file?
>
> Row: (file A)_(2014/06/14)
>
>    d:1341069600 value
>
>    d:1341069500 value
>
>    d:1341069400 value
>
> Row: (file A)_(2014/06/13)
>
>    d:1341059600 value
>
>    d:1341059500 value
>
>    d:1341059400 value
>
> Row: (file B)_(2014/06/14)
>
>    d:1341069700 value
>
>    d:1341069580 value
>
>    d:1341069401 value
>
>
>
>
>
>
> 发自 Windows 邮件

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message