hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Re: indexing question
Date Fri, 03 Jul 2009 07:37:20 GMT
Those 2 tables could be collapsed into 1 table with 2 columns of course...

On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<timrobertson100@gmail.com> wrote:
> Hi,
> Disclaimer: I am a newbie, so this is just one option, and I am basing
> on my understanding that secondary indexes are not yet working on
> HBase...
> So since HBase has very fast "get by primary key", but is *still* (?)
> without working secondary indexes, you would need to do scans to find
> the records.  A workaround would be to have 2 more tables
> "Country_Contains" and "Country_Contained_In", and in each table, the
> primary key is the unique ID of the country, the payload being the
> Keys to the rows in the main table.  Basically this is creating 2
> tables to act as the index manually.  This is a duplication of data,
> and would require management of 3 tables wrapped in a transaction when
> doing CRUD, but it would allow for lookup of the rows to modify
> without need for scanning.
> Just one idea...
> Cheers,
> Tim
> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ishaaq@gmail.com> wrote:
>> Hi all,
>> I am pretty new to HBase so forgive me if this seems like a silly question.
>> Each row in my Hbase table is a geographical location that is related to
>> other locations. For e.g. one relationship is the CONTAIN relationship. So,
>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>> relationship as well called PARENT, so England has a PARENT called Europe.
>> However, note that, for various business reasons not pertinant to this
>> discussion, the inverse relationship need not always be set, i.e. we may not
>> store France with a PARENT value of Europe, even though Europe CONTAINs
>> France.
>> So, I store each location as a row with an id and the payload data for that
>> location as a separate data column. This data column includes the sets of
>> ids of the related locations.
>> Now, I want to be able to update/delete locations consistently. So, in my
>> example above, I might want to delete France, in which case I also want to
>> make sure that I delete the CONTAINs relationship that Europe has with
>> France as that is now obsolete. What is the most efficient way to do this? I
>> want to minimise the number of writes I would have to do - on the other hand
>> optimising read performance is more important as writes do not happen that
>> often (this is geographic data after all).
>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
>> operation to delete France and n write operations to delete the
>> relationships that n other locations may have to France. In the case of a
>> root location like Europe that may have a large number of locations that
>> relate to it this may be expensive, but I see no other way.
>> So, I was wondering, how do I index this to speed this up as far as
>> possible. So, given the location Europe, what are the fields I should
>> include in its row and how to index them? I could create a column family for
>> each relationship type with a label - the label being the id of the location
>> this location is related to, so, for e.g., the Europe row would have a
>> column called CONTAIN:England (assuming "England" is the id for the England
>> column - in reality it would be a UUID). I would then have as many labels
>> under the CONTAIN family for Europe as locations that Europe contains.
>> How would I index this and ensure that when deleting France the query: "list
>> all locations that CONTAIN France" returns with Europe (and whatever else)
>> as quickly as possible?
>> Thanks,
>> Ishaaq

View raw message