hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ishaaq Chandy <ish...@gmail.com>
Subject Re: indexing question
Date Sun, 05 Jul 2009 05:30:46 GMT

Thanks for the responses guys, sounds like multiple relationship columns are
the way to go.

However, I am still curious about the earlier comments about secondary
indices not being available - does this mean I should avoid using
IndexedTable and IndexConfiguration completely? I am using 0.19.3.


Jonathan Gray-2 wrote:
> Ishaaq,
> I think Michael has the right idea.  You can do this in a single table 
> with two families per relationship.
> The row key is the location uuid.  And you could have 6 families:
> You would duplicate all your data, but random writes are quite 
> fast/cheap in HBase.  So updates to this big fat row would be very fast, 
> this is a big advantage with HBase.
> So to delete France, you would grab all the families from it's row, and 
> iterate down them, deleting the other side of the relationship.
> === (unverified pseudo code)
> HTable ht = new HTable("locations");
> Result result = ht.get(new Get(france_id));
> NavigableMap<byte[],byte[]> containedIn =
>    result.getFamilyMap(Bytes.toBytes("CONTAINED_IN"));
> for(byte [] containedIn_id : containedIn.keySet()) {
>    Delete delete = new Delete(containedIn_id);
>    delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id);
>    ht.delete(delete);
> }
> ==
> You'd have to do that for all 6 relationships, so it doubles your delete 
> work as well.
> How you design this really depends on what your requirements are.  Many 
> times I'll start with a design like this which basically allows any kind 
> of operation relatively quickly at the expense of doubling the data.
> Most other designs are going to require some type of scanning.
> JG
> Michael Robellard wrote:
>> Assuming you always know the location you want to start from:
>> Can't you have a table with a column family called contains: which holds 
>> all
>> the places that place contains.
>> Another Column Family for the contained in relationship
>> and a third column family for surrounding
>> The keys for each column value would be the row key for the location and 
>> then if you had information that you used all the time you could store 
>> it in the value for the column so you don't have to do second table 
>> lookup all the time
>> Ishaaq Chandy wrote:
>>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well
>>> aware of the reasons why to use HBase over a traditional RDBMS - so am 
>>> not
>>> complaining about this.
>>> No, I was asking the question because I was not sure what the best 
>>> approach
>>> would be.
>>> By the way, I did not convey the whole story - there is actually a third
>>> type of relationship as well - SURROUNDING - i.e. adjacent geographical
>>> locations SURROUND each other (again, for business reasons, this
>>> relationship is not necessarily always reflexive - though it usually
>>> is).
>>> So, when you say HBase doesn't provide declarative secondary indices you
>>> lost me - what are these? How are these different from the ones
>>> available
>>> via IndexedTable and IndexSpecification?
>>> Hmm, I was hoping by using sparse values in a column family labelled 
>>> by the
>>> location ids I would just have to search for rows which had a non-empty
>>> value for the CONTAIN:France column to retrieve the values for that 
>>> example
>>> query I mentioned. I understand that that would make the CONTAIN column
>>> family (and the PARENT and SURROUNDING families too) quite wide but I
>>> remember reading somewhere that that was quite acceptable for HBase.
>>> Further, I was hoping, since the columns labels themselves contain the 
>>> data
>>> I am searching for, that there would an efficient way to do this 
>>> (don't know
>>> why or how - I was just hoping).
>>> Anyway, if it means that the only way to do this efficiently in HBase is
>>> using four tables - one for the locations and one for each of the three
>>> types of relationships then so be it - that is what I'll have to do - 
>>> I was
>>> just hoping for a simpler alternative with my idea to use column
>>> families
>>> labelled by the location ids.
>>> Ishaaq
>>> Ryan Rawson wrote:
>>>> Hey,
>>>> HBase doesn't provide declarative secondary indexes.  Your app code
>>>> needs to maintain them, writing into 2 tables with dual writes.  You
>>>> don't have to duplicate data, you can just use the secondary index as
>>>> a pointer into the main table, causing you to have to chase down
>>>> potentially thousands of extra RPCs. There are no hbase transactions
>>>> when you are modifying multiple tables, but that isnt as big of a
>>>> problem as it seems.
>>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
>>>> remind you what HBase is for, and perhaps you can make a better
>>>> choice.
>>>> HBase is when you hit the limits of what you can do with mysql.  When
>>>> you work to scale mysql you end up removing the following features:
>>>> - no transactions
>>>> - no secondary indexes (slow on mysql/innodb)
>>>> - separate multiple table indexes on different databases
>>>> - sharding (last step)
>>>> Once you hit the magical 300-500GB size and you have hit the end of
>>>> where master-slave replication scaling can take you, you need to move
>>>> on to different techniques and technology.  This is where HBase picks
>>>> up.
>>>> So all the things you list below as 'negatives' are the reality on the
>>>> ground when you scale no matter what technology you use.  If they
>>>> sound too ugly for you, perhaps you really need mysql?
>>>> On Fri, Jul 3, 2009 at 12:37 AM, tim 
>>>> robertson<timrobertson100@gmail.com>
>>>> wrote:
>>>>> Those 2 tables could be collapsed into 1 table with 2 columns of
>>>>> course...
>>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim 
>>>>> robertson<timrobertson100@gmail.com>
>>>>> wrote:
>>>>>> Hi,
>>>>>> Disclaimer: I am a newbie, so this is just one option, and I am
>>>>>> basing
>>>>>> on my understanding that secondary indexes are not yet working on
>>>>>> HBase...
>>>>>> So since HBase has very fast "get by primary key", but is *still*
>>>>>> without working secondary indexes, you would need to do scans to
>>>>>> the records.  A workaround would be to have 2 more tables
>>>>>> "Country_Contains" and "Country_Contained_In", and in each table,
>>>>>> primary key is the unique ID of the country, the payload being the
>>>>>> Keys to the rows in the main table.  Basically this is creating 2
>>>>>> tables to act as the index manually.  This is a duplication of data,
>>>>>> and would require management of 3 tables wrapped in a transaction
>>>>>> when
>>>>>> doing CRUD, but it would allow for lookup of the rows to modify
>>>>>> without need for scanning.
>>>>>> Just one idea...
>>>>>> Cheers,
>>>>>> Tim
>>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ishaaq@gmail.com>
>>>>>> wrote:
>>>>>>> Hi all,
>>>>>>> I am pretty new to HBase so forgive me if this seems like a silly
>>>>>>> question.
>>>>>>> Each row in my Hbase table is a geographical location that is
>>>>>>> related
>>>>>>> to
>>>>>>> other locations. For e.g. one relationship is the CONTAIN 
>>>>>>> relationship.
>>>>>>> So,
>>>>>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>>>>>> relationship as well called PARENT, so England has a PARENT called
>>>>>>> Europe.
>>>>>>> However, note that, for various business reasons not pertinant
>>>>>>> this
>>>>>>> discussion, the inverse relationship need not always be set,
i.e. we
>>>>>>> may not
>>>>>>> store France with a PARENT value of Europe, even though Europe

>>>>>>> CONTAINs
>>>>>>> France.
>>>>>>> So, I store each location as a row with an id and the payload
>>>>>>> for
>>>>>>> that
>>>>>>> location as a separate data column. This data column includes
>>>>>>> sets
>>>>>>> of
>>>>>>> ids of the related locations.
>>>>>>> Now, I want to be able to update/delete locations consistently.

>>>>>>> So, in
>>>>>>> my
>>>>>>> example above, I might want to delete France, in which case I
>>>>>>> want
>>>>>>> to
>>>>>>> make sure that I delete the CONTAINs relationship that Europe
>>>>>>> with
>>>>>>> France as that is now obsolete. What is the most efficient way
to do
>>>>>>> this? I
>>>>>>> want to minimise the number of writes I would have to do - on
>>>>>>> other
>>>>>>> hand
>>>>>>> optimising read performance is more important as writes do not
>>>>>>> happen
>>>>>>> that
>>>>>>> often (this is geographic data after all).
>>>>>>> My thoughts are: I will have to do 1+n writes to do a delete
- i.e.
>>>>>>> 1
>>>>>>> write
>>>>>>> operation to delete France and n write operations to delete the
>>>>>>> relationships that n other locations may have to France. In the

>>>>>>> case of
>>>>>>> a
>>>>>>> root location like Europe that may have a large number of locations
>>>>>>> that
>>>>>>> relate to it this may be expensive, but I see no other way.
>>>>>>> So, I was wondering, how do I index this to speed this up as
far as
>>>>>>> possible. So, given the location Europe, what are the fields
>>>>>>> should
>>>>>>> include in its row and how to index them? I could create a column
>>>>>>> family for
>>>>>>> each relationship type with a label - the label being the id
of the
>>>>>>> location
>>>>>>> this location is related to, so, for e.g., the Europe row would

>>>>>>> have a
>>>>>>> column called CONTAIN:England (assuming "England" is the id for
>>>>>>> England
>>>>>>> column - in reality it would be a UUID). I would then have as
>>>>>>> labels
>>>>>>> under the CONTAIN family for Europe as locations that Europe

>>>>>>> contains.
>>>>>>> How would I index this and ensure that when deleting France the

>>>>>>> query:
>>>>>>> "list
>>>>>>> all locations that CONTAIN France" returns with Europe (and whatever
>>>>>>> else)
>>>>>>> as quickly as possible?
>>>>>>> Thanks,
>>>>>>> Ishaaq

View this message in context: http://www.nabble.com/indexing-question-tp24318679p24340332.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message