hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: data structure
Date Thu, 14 Jul 2011 20:28:03 GMT
On Thu, Jul 14, 2011 at 12:52 PM, Andre Reiter <a.reiter@web.de> wrote:
> new we are running mapreduce jobs, to generate a report: for example we want
> to know how many impressions were done by all users in last x days.
> therefore the scan of the MR job is running over all data in our hbase table
> for the particular family. this takes at the moment about 70 seconds, which
> is actually a bit too long, and with the data growing, the time will
> increase, unless we add new workers to the cluster. we have right now 22
> regions
>

Why is 70 seconds too long for a report?  70 seconds seems like a
short mapreduce job (to me).

You don't have that many regions.

How fast would you like this operation to complete in?

The report you describe above is predicated on looking at all data,
right?  If so, I'm not sure how you'd avoid the job taking longer the
more data you have (unless you up the parallelism and/or cluster size)

> the problem i see, is that we can not define a filter for the scan, the row
> key (user id) is just an UUID, nothing meaningfull in it
>

What do you want to filter out?  When you scan, you are working to
narrow its scope by setting time-range, famliy, etc.


> what can we do, to however improve (accelerate) the scan process? is it
> maybe advisable to store the data more redundant. so for example we create
> second table and store every impression twice, one time with the user id as
> row key in the first table, and the second one with a time stamp as a row
> key in the second table.

You could do this to make a view that was more amenable to your report
generation.


> the data volume would grow twice as fast, but our scans will work x times
> faster on the second table compared to now
>

You'd have to figure what you can tolerate.  Slower writing because
now you are writing two places instead of one but your reports will
run faster.

St.Ack

Mime
View raw message