hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Varma <svarma...@gmail.com>
Subject Re: Hbase row key & MapReduce
Date Tue, 01 Mar 2011 17:45:33 GMT
Please check these threads:


On Tue, Mar 1, 2011 at 8:40 AM, Felix Sprick <fsprick@gmail.com> wrote:

> Hi everyone,
> I have a question regarding the design of the row key for a Hbase table. I
> am working with a system storing hundrets of values up to 50 times per
> second over a period of several month. I want to run MapReduce jobs on this
> data performing simple calculations for each row within a certain period of
> time (usually hours but potentially also days and weeks). MapReduce because
> it would allow us to run this simple calucation in parallel in the cluster.
> How do I manage to have the data distributed over the Hbase cluster so that
> the MapReduce calculation involves as many nodes as possible? If I use the
> timestamp as row-key I would end up with all data on one/few machines and
> run into hotspotting issues plus the MapReduce job would only run on a
> subset of all machines in the cluster. If I invert the timestamp and
> use this as the row-key I have the data distributed more evenly and
> MapReduce jobs could run on several machines. Problem then is that I
> wouldnt
> be able to restrict the input to the MapReduce scan with startRow/stopRow
> filters on the scan because rows belonging to one time frame wouldnt be
> stored sequentelly any longer. Or is MapReduce designed in a way that I
> always have to walk through the entire database row by row?
> Any ideas?
> thanks,
> Felix

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message