hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ishaaq Chandy <ish...@gmail.com>
Subject Re: indexing question
Date Fri, 03 Jul 2009 10:15:11 GMT

No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
aware of the reasons why to use HBase over a traditional RDBMS - so am not
complaining about this.

No, I was asking the question because I was not sure what the best approach
would be.

By the way, I did not convey the whole story - there is actually a third
type of relationship as well - SURROUNDING - i.e. adjacent geographical
locations SURROUND each other (again, for business reasons, this
relationship is not necessarily always reflexive - though it usually is).

So, when you say HBase doesn't provide declarative secondary indices you
lost me - what are these? How are these different from the ones available
via IndexedTable and IndexSpecification?

Hmm, I was hoping by using sparse values in a column family labelled by the
location ids I would just have to search for rows which had a non-empty
value for the CONTAIN:France column to retrieve the values for that example
query I mentioned. I understand that that would make the CONTAIN column
family (and the PARENT and SURROUNDING families too) quite wide but I
remember reading somewhere that that was quite acceptable for HBase.

Further, I was hoping, since the columns labels themselves contain the data
I am searching for, that there would an efficient way to do this (don't know
why or how - I was just hoping).

Anyway, if it means that the only way to do this efficiently in HBase is
using four tables - one for the locations and one for each of the three
types of relationships then so be it - that is what I'll have to do - I was
just hoping for a simpler alternative with my idea to use column families
labelled by the location ids.


Ryan Rawson wrote:
> Hey,
> HBase doesn't provide declarative secondary indexes.  Your app code
> needs to maintain them, writing into 2 tables with dual writes.  You
> don't have to duplicate data, you can just use the secondary index as
> a pointer into the main table, causing you to have to chase down
> potentially thousands of extra RPCs. There are no hbase transactions
> when you are modifying multiple tables, but that isnt as big of a
> problem as it seems.
> If all this sounds very 'raw' and 'painful' and 'error prone', let me
> remind you what HBase is for, and perhaps you can make a better
> choice.
> HBase is when you hit the limits of what you can do with mysql.  When
> you work to scale mysql you end up removing the following features:
> - no transactions
> - no secondary indexes (slow on mysql/innodb)
> - separate multiple table indexes on different databases
> - sharding (last step)
> Once you hit the magical 300-500GB size and you have hit the end of
> where master-slave replication scaling can take you, you need to move
> on to different techniques and technology.  This is where HBase picks
> up.
> So all the things you list below as 'negatives' are the reality on the
> ground when you scale no matter what technology you use.  If they
> sound too ugly for you, perhaps you really need mysql?
> On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<timrobertson100@gmail.com>
> wrote:
>> Those 2 tables could be collapsed into 1 table with 2 columns of
>> course...
>> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<timrobertson100@gmail.com>
>> wrote:
>>> Hi,
>>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>>> on my understanding that secondary indexes are not yet working on
>>> HBase...
>>> So since HBase has very fast "get by primary key", but is *still* (?)
>>> without working secondary indexes, you would need to do scans to find
>>> the records.  A workaround would be to have 2 more tables
>>> "Country_Contains" and "Country_Contained_In", and in each table, the
>>> primary key is the unique ID of the country, the payload being the
>>> Keys to the rows in the main table.  Basically this is creating 2
>>> tables to act as the index manually.  This is a duplication of data,
>>> and would require management of 3 tables wrapped in a transaction when
>>> doing CRUD, but it would allow for lookup of the rows to modify
>>> without need for scanning.
>>> Just one idea...
>>> Cheers,
>>> Tim
>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ishaaq@gmail.com> wrote:
>>>> Hi all,
>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>> question.
>>>> Each row in my Hbase table is a geographical location that is related
>>>> to
>>>> other locations. For e.g. one relationship is the CONTAIN relationship.
>>>> So,
>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>> relationship as well called PARENT, so England has a PARENT called
>>>> Europe.
>>>> However, note that, for various business reasons not pertinant to this
>>>> discussion, the inverse relationship need not always be set, i.e. we
>>>> may not
>>>> store France with a PARENT value of Europe, even though Europe CONTAINs
>>>> France.
>>>> So, I store each location as a row with an id and the payload data for
>>>> that
>>>> location as a separate data column. This data column includes the sets
>>>> of
>>>> ids of the related locations.
>>>> Now, I want to be able to update/delete locations consistently. So, in
>>>> my
>>>> example above, I might want to delete France, in which case I also want
>>>> to
>>>> make sure that I delete the CONTAINs relationship that Europe has with
>>>> France as that is now obsolete. What is the most efficient way to do
>>>> this? I
>>>> want to minimise the number of writes I would have to do - on the other
>>>> hand
>>>> optimising read performance is more important as writes do not happen
>>>> that
>>>> often (this is geographic data after all).
>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1
>>>> write
>>>> operation to delete France and n write operations to delete the
>>>> relationships that n other locations may have to France. In the case of
>>>> a
>>>> root location like Europe that may have a large number of locations
>>>> that
>>>> relate to it this may be expensive, but I see no other way.
>>>> So, I was wondering, how do I index this to speed this up as far as
>>>> possible. So, given the location Europe, what are the fields I should
>>>> include in its row and how to index them? I could create a column
>>>> family for
>>>> each relationship type with a label - the label being the id of the
>>>> location
>>>> this location is related to, so, for e.g., the Europe row would have a
>>>> column called CONTAIN:England (assuming "England" is the id for the
>>>> England
>>>> column - in reality it would be a UUID). I would then have as many
>>>> labels
>>>> under the CONTAIN family for Europe as locations that Europe contains.
>>>> How would I index this and ensure that when deleting France the query:
>>>> "list
>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>> else)
>>>> as quickly as possible?
>>>> Thanks,
>>>> Ishaaq

View this message in context: http://www.nabble.com/indexing-question-tp24318679p24320773.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message