hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_hb...@fucit.org>
Subject Some REST GET questions
Date Mon, 23 Mar 2009 21:22:12 GMT

I've got myself a little HBase install up and running on a small Hadoop 
cluster, currently running...
 	HBase Version	0.19.0, r735381
 	HBase Compiled	Sun Jan 18 14:29:34 PST 2009, stack
 	Hadoop Version	0.19.0, r713890
 	Hadoop Compiled	Fri Nov 14 03:12:29 UTC 2008, ndaley

testing stuff out with the hbase shell, things are working nicely.  I'm 
also using trying out the REST API, and I have a few questions about
how to execute certain queries.

First off, this is the table i'm testing with...

{NAME => 'userdata', IS_ROOT => 'false', IS_META => 'false',
  FAMILIES => [{NAME => 'hist', BLOOMFILTER => 'false', COMPRESSION => 
'NONE', VERSIONS => '20', LENGTH => '2147483647', TTL => '-1', IN_MEMORY => 'false',

BLOCKCACHE => 'false'}, {NAME => 'user', BLOOMFILTER => 'false', 
COMPRESSION => 'NONE', VERSIONS => '1', LENGTH => '2147483647', TTL => 
'-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}], INDEXES => []}

This hypothetical example being a user activity tracking system -- the 
"keys' will be usernames, and for every action a user takes, a row will be 
inserted into the userdata table.  for some of the data i only care about 
the last action the user took, and i put that in the "user" column family 
(only 1 version) and for other pieces of data i want to keep a history of 
the last 20 actions the user took (the "hist" column family)

My first question is about clarifying what should/shoulnd't be base64 
encoded.  According to the wiki docs for hte rest interface...
...the "value" portion of a column entry is base64 
encoded, but the "name" is not -- this matches the behavior i observe when 
POSTing data and then inspecting it using the hbase shell -- however when 
I GET results from a query using the REST interface, the names are coming 
back base64 encoded as well.  This message from a year ago seems to 
suggest that this is the expected behavior because names "can be arbitrary 
binary strings." ...

...but in that case there is API descrepency between the I and the O in 
the I/O of the REST interface.  which is considered more correct? is 
there a migration plan for rectifying the discrepency?

Second Question: querying for multiple version.  I'm trying to figure out 
how i can execute the following query (from the hbase shell) via the REST 
    get 'userdata', 'hossman', {COLUMN => 'hist:vote', VERSIONS => 10}
...my naive assumption based on the other examples on the wiki are that 
something like this might work...
...but the "versions" request param seems to be ignored.  Is this type of 
multi-version query at all supported in the REST interface?

My last question also relates to querying for multiple versions of columns 
-- the key question being "column(s)" plural.  as i mentioned before, this 
query in the base shell works fine for getting the last 10 versions of a 
specific column...
     get 'userdata', 'hossman', {COLUMN => 'hist:vote', VERSIONS => 10}
...but i can't seem to find any way to indicate that i want the last 
10 versions of *all* the columns associated with the specified key 
-- in either the REST interface or the hbase shell. I was particularly 
suprised by this error...

    get 'userdata', 'hossman', { VERSIONS => 10 }
TypeError: can't convert Hash into String
 	from /var/opt/chrish-hadoop/hbase-0.19.0/bin/../bin/hirb.rb:326:in `get'
 	from /var/opt/chrish-hadoop/hbase-0.19.0/bin/../bin/hirb.rb:326:in `get'
 	from (hbase):47:in `binding'
Maybe IRB bug!!

...and the fact that this query only produced the most recent values for 
the specified columns (even though querying for either of them 
individually with the VERSIONS=>10 option produced the full lsit for 
    get 'userdata','hossman',{COLUMNS=>['hist:vote','hist:doc'],VERSIONS=>10}
COLUMN                       CELL
  hist:doc                    timestamp=1237842101205, value=2908
  hist:vote                   timestamp=1237842101205, value=23
2 row(s) in 0.0360 seconds

Obviously anything in the "user" family only has one version (because 
that's the way the family was declared) but that's ok -- my goal is to get 
whatever data is available going back up to 10 versions.  It's not so bad 
if i have to execute two REST GETs: one for all of the current values in 
the 'user' family, and one for the last 10 versions of all the values in 
the 'hist' family; and it's not the end of the world if i have to 
explicitly list all of the column names i want in each request -- but 
making a seperate request for every column name that has multiple versions 
seems like it could get prohibitive.

Thanks in advance for any light people might be able to shed on these 


View raw message