hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Sela <am...@infolinks.com>
Subject Re: HBase load distribution vs. scan efficiency
Date Sun, 19 Jan 2014 21:02:52 GMT
If you'll use bulk load to insert your data you could use the date as key
prefix and choose the rest of the key in a way that will split each day
evenly. You'll have X regions for Evey day >> 14X regions for the two weeks
On Jan 19, 2014 8:39 PM, "Bill Q" <bill.q.hdp@gmail.com> wrote:

> Hi,
> I am designing a schema to host some large volume of data over HBase. We
> collect daily trading data for some markets. And we run a moving window
> analysis to make predictions based on a two weeks window.
> Since everybody is going to pull the latest two weeks data every day, if we
> put the date in the lead positions of the Key, we will have some hot
> regions. So, we can use bucketing (date to mode bucket number) approach to
> deal with this situation. However, if we have 200 buckets, we need to run
> 200 scans to extract all the data in the last two weeks.
> My questions are:
> 1. What happens when each scan return the result? Will the scan result be
> sent to a sink  like place that collects and concatenate all the scan
> results?
> 2. Why having 200 scans might be a bad thing compared to have only 10
> scans?
> 3. Any suggestions to the design?
> Many thanks.
> Bill

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message