hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Coprocessors
Date Thu, 25 Apr 2013 23:00:46 GMT
Thanks for the additional info, Sudarshan. This would fit well with the 
implementation of Phoenix's skip scan.

     object_id INTEGER NOT NULL,
     field_type INTEGER NOT NULL,
     attrib_id INTEGER NOT NULL,
     value BIGINT
     CONSTRAINT pk PRIMARY KEY (object_id, field_type, attribute_id));

SELECT count(value), sum(value),avg(value) FROM t
WHERE object_id IN (?,?,?) AND field_type IN (?,?,?) AND attribute_type 
IN (?,?,?)

and then your client would do whatever additional computation it needed 
on the results it got back.

Would that fit with what you're trying to do?


On 04/25/2013 03:36 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:
> Michael: Fair enough. Let me see what relevant information I can add to what I've already
> 1. To Lars' point, my 250K keys are unlikely to fall into fewer than 250K sub-ranges.
> 2. Here's a bit more about my schema:
>   2.1 My rowkeys are composed of 2 entities - let's call it object-id and field-type.
An object (O1) has 100s of field types (F1,F2,F3...). Each object-id - field-type pair has
100s of attributes (A1,A2,A3).
>   2.2 My rowkeys are O1-F1, O1-F2, O1-F3, etc.
>   2.3 My primary application (not the one my original post was about) accesses by these
>   2.4 My application that does aggregation is given a bunch of objects <O1, O2, O3>,
a field-type <F1>, a bunch of attributes <A1,A2> and some computation to perform.
>   2.5 As you can see, scans are unlikely to be useful when fetching O1-F1, O2-F1, O3-F1
> Viral: How do I tackle aggregation using observers? Let's say I override the postGet
method. I do a multi-get from my client and my method gets called on each region server for
each row. What is the next step with this approach?
> ----- Original Message -----
> From: user@hbase.apache.org
> To: larsh@apache.org, user@hbase.apache.org
> Cc: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
> At: Apr 25 2013 18:12:46
> I don't think Phoenix will solve his problem.
> He also needs to explain more about his problem before we can start to think about the
> On Apr 25, 2013, at 4:54 PM, lars hofhansl <larsh@apache.org> wrote:
>> You might want to have a look at Phoenix (https://github.com/forcedotcom/phoenix),
which does that and more, and gives a SQL/JDBC interface.
>> -- Lars
>> ________________________________
>> From: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) <skadambi@bloomberg.net>
>> To: user@hbase.apache.org
>> Sent: Thursday, April 25, 2013 2:44 PM
>> Subject: Coprocessors
>> Folks:
>> This is my first post on the HBase user mailing list.
>> I have the following scenario:
>> I've a HBase table of upto a billion keys. I'm looking to support an application
where on some user action, I'd need to fetch multiple columns for upto 250K keys and do some
sort of aggregation on it. Fetching all that data and doing the aggregation in my application
takes about a minute.
>> I'm looking to co-locate the aggregation logic with the region servers to
>> a. Distribute the aggregation
>> b. Avoid having to fetch large amounts of data over the network (this could potentially
be cross-datacenter)
>> Neither observers nor aggregation endpoints work for this use case. Observers don't
return data back to the client while aggregation endpoints work in the context of scans not
a multi-get (Are these correct assumptions?).
>> I'm looking to write a service that runs alongside the region servers and acts a
proxy b/w my application and the region servers.
>> I plan to use the logic in HBase client's HConnectionManager, to segment my request
of 1M rowkeys into sub-requests per region-server. These are sent over to the proxy which
fetches the data from the region server, aggregates locally and sends data back. Does this
sound reasonable or even a useful thing to pursue?
>> Regards,
>> -sudarshan

View raw message