hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: indexing question
Date Sun, 05 Jul 2009 20:58:10 GMT
On Sat, Jul 4, 2009 at 10:30 PM, Ishaaq Chandy <ishaaq@gmail.com> wrote:

>
> However, I am still curious about the earlier comments about secondary
> indices not being available - does this mean I should avoid using
> IndexedTable and IndexConfiguration completely? I am using 0.19.3.
>



The lads were probably talking about TRUNK.  Transactional and indexed hbase
were absent a while as they were being rejiggered to fit new hbase
internals.  They are still in a state of flux but should be put back
together by time of 0.20.0 release.

In 0.19.x, this facility is experimental but works.

St.Ack



>
> Thanks,
> Ishaaq
>
>
> Jonathan Gray-2 wrote:
> >
> > Ishaaq,
> >
> > I think Michael has the right idea.  You can do this in a single table
> > with two families per relationship.
> >
> > The row key is the location uuid.  And you could have 6 families:
> >
> > CONTAINS, CONTAINED_IN, PARENTS, PARENT_OF, SURROUNDS, SURROUNDED_BY
> >
> > You would duplicate all your data, but random writes are quite
> > fast/cheap in HBase.  So updates to this big fat row would be very fast,
> > this is a big advantage with HBase.
> >
> > So to delete France, you would grab all the families from it's row, and
> > iterate down them, deleting the other side of the relationship.
> >
> > === (unverified pseudo code)
> >
> > HTable ht = new HTable("locations");
> > Result result = ht.get(new Get(france_id));
> >
> > NavigableMap<byte[],byte[]> containedIn =
> >    result.getFamilyMap(Bytes.toBytes("CONTAINED_IN"));
> >
> > for(byte [] containedIn_id : containedIn.keySet()) {
> >    Delete delete = new Delete(containedIn_id);
> >    delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id);
> >    ht.delete(delete);
> > }
> >
> > ==
> >
> > You'd have to do that for all 6 relationships, so it doubles your delete
> > work as well.
> >
> > How you design this really depends on what your requirements are.  Many
> > times I'll start with a design like this which basically allows any kind
> > of operation relatively quickly at the expense of doubling the data.
> >
> > Most other designs are going to require some type of scanning.
> >
> > JG
> >
> >
> > Michael Robellard wrote:
> >> Assuming you always know the location you want to start from:
> >>
> >> Can't you have a table with a column family called contains: which holds
> >> all
> >> the places that place contains.
> >>
> >> Another Column Family for the contained in relationship
> >>
> >> and a third column family for surrounding
> >>
> >> The keys for each column value would be the row key for the location and
> >> then if you had information that you used all the time you could store
> >> it in the value for the column so you don't have to do second table
> >> lookup all the time
> >>
> >> Ishaaq Chandy wrote:
> >>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am
> well
> >>> aware of the reasons why to use HBase over a traditional RDBMS - so am
> >>> not
> >>> complaining about this.
> >>>
> >>> No, I was asking the question because I was not sure what the best
> >>> approach
> >>> would be.
> >>>
> >>>
> >>> By the way, I did not convey the whole story - there is actually a
> third
> >>> type of relationship as well - SURROUNDING - i.e. adjacent geographical
> >>> locations SURROUND each other (again, for business reasons, this
> >>> relationship is not necessarily always reflexive - though it usually
> >>> is).
> >>>
> >>> So, when you say HBase doesn't provide declarative secondary indices
> you
> >>> lost me - what are these? How are these different from the ones
> >>> available
> >>> via IndexedTable and IndexSpecification?
> >>>
> >>> Hmm, I was hoping by using sparse values in a column family labelled
> >>> by the
> >>> location ids I would just have to search for rows which had a non-empty
> >>> value for the CONTAIN:France column to retrieve the values for that
> >>> example
> >>> query I mentioned. I understand that that would make the CONTAIN column
> >>> family (and the PARENT and SURROUNDING families too) quite wide but I
> >>> remember reading somewhere that that was quite acceptable for HBase.
> >>>
> >>> Further, I was hoping, since the columns labels themselves contain the
> >>> data
> >>> I am searching for, that there would an efficient way to do this
> >>> (don't know
> >>> why or how - I was just hoping).
> >>>
> >>> Anyway, if it means that the only way to do this efficiently in HBase
> is
> >>> using four tables - one for the locations and one for each of the three
> >>> types of relationships then so be it - that is what I'll have to do -
> >>> I was
> >>> just hoping for a simpler alternative with my idea to use column
> >>> families
> >>> labelled by the location ids.
> >>>
> >>> Ishaaq
> >>>
> >>>
> >>> Ryan Rawson wrote:
> >>>
> >>>> Hey,
> >>>>
> >>>> HBase doesn't provide declarative secondary indexes.  Your app code
> >>>> needs to maintain them, writing into 2 tables with dual writes.  You
> >>>> don't have to duplicate data, you can just use the secondary index as
> >>>> a pointer into the main table, causing you to have to chase down
> >>>> potentially thousands of extra RPCs. There are no hbase transactions
> >>>> when you are modifying multiple tables, but that isnt as big of a
> >>>> problem as it seems.
> >>>>
> >>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me
> >>>> remind you what HBase is for, and perhaps you can make a better
> >>>> choice.
> >>>>
> >>>> HBase is when you hit the limits of what you can do with mysql.  When
> >>>> you work to scale mysql you end up removing the following features:
> >>>> - no transactions
> >>>> - no secondary indexes (slow on mysql/innodb)
> >>>> - separate multiple table indexes on different databases
> >>>> - sharding (last step)
> >>>>
> >>>> Once you hit the magical 300-500GB size and you have hit the end of
> >>>> where master-slave replication scaling can take you, you need to move
> >>>> on to different techniques and technology.  This is where HBase picks
> >>>> up.
> >>>>
> >>>> So all the things you list below as 'negatives' are the reality on the
> >>>> ground when you scale no matter what technology you use.  If they
> >>>> sound too ugly for you, perhaps you really need mysql?
> >>>>
> >>>>
> >>>> On Fri, Jul 3, 2009 at 12:37 AM, tim
> >>>> robertson<timrobertson100@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Those 2 tables could be collapsed into 1 table with 2 columns of
> >>>>> course...
> >>>>>
> >>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim
> >>>>> robertson<timrobertson100@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Disclaimer: I am a newbie, so this is just one option, and I
am
> >>>>>> basing
> >>>>>> on my understanding that secondary indexes are not yet working
on
> >>>>>> HBase...
> >>>>>>
> >>>>>> So since HBase has very fast "get by primary key", but is *still*
> (?)
> >>>>>> without working secondary indexes, you would need to do scans
to
> find
> >>>>>> the records.  A workaround would be to have 2 more tables
> >>>>>> "Country_Contains" and "Country_Contained_In", and in each table,
> the
> >>>>>> primary key is the unique ID of the country, the payload being
the
> >>>>>> Keys to the rows in the main table.  Basically this is creating
2
> >>>>>> tables to act as the index manually.  This is a duplication
of data,
> >>>>>> and would require management of 3 tables wrapped in a transaction
> >>>>>> when
> >>>>>> doing CRUD, but it would allow for lookup of the rows to modify
> >>>>>> without need for scanning.
> >>>>>>
> >>>>>> Just one idea...
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ishaaq@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi all,
> >>>>>>> I am pretty new to HBase so forgive me if this seems like
a silly
> >>>>>>> question.
> >>>>>>>
> >>>>>>> Each row in my Hbase table is a geographical location that
is
> >>>>>>> related
> >>>>>>> to
> >>>>>>> other locations. For e.g. one relationship is the CONTAIN
> >>>>>>> relationship.
> >>>>>>> So,
> >>>>>>> Europe CONTAINs  England, France, Spain etc. There is an
inverse
> >>>>>>> relationship as well called PARENT, so England has a PARENT
called
> >>>>>>> Europe.
> >>>>>>> However, note that, for various business reasons not pertinant
to
> >>>>>>> this
> >>>>>>> discussion, the inverse relationship need not always be
set, i.e.
> we
> >>>>>>> may not
> >>>>>>> store France with a PARENT value of Europe, even though
Europe
> >>>>>>> CONTAINs
> >>>>>>> France.
> >>>>>>>
> >>>>>>> So, I store each location as a row with an id and the payload
data
> >>>>>>> for
> >>>>>>> that
> >>>>>>> location as a separate data column. This data column includes
the
> >>>>>>> sets
> >>>>>>> of
> >>>>>>> ids of the related locations.
> >>>>>>>
> >>>>>>> Now, I want to be able to update/delete locations consistently.
> >>>>>>> So, in
> >>>>>>> my
> >>>>>>> example above, I might want to delete France, in which case
I also
> >>>>>>> want
> >>>>>>> to
> >>>>>>> make sure that I delete the CONTAINs relationship that Europe
has
> >>>>>>> with
> >>>>>>> France as that is now obsolete. What is the most efficient
way to
> do
> >>>>>>> this? I
> >>>>>>> want to minimise the number of writes I would have to do
- on the
> >>>>>>> other
> >>>>>>> hand
> >>>>>>> optimising read performance is more important as writes
do not
> >>>>>>> happen
> >>>>>>> that
> >>>>>>> often (this is geographic data after all).
> >>>>>>>
> >>>>>>> My thoughts are: I will have to do 1+n writes to do a delete
- i.e.
> >>>>>>> 1
> >>>>>>> write
> >>>>>>> operation to delete France and n write operations to delete
the
> >>>>>>> relationships that n other locations may have to France.
In the
> >>>>>>> case of
> >>>>>>> a
> >>>>>>> root location like Europe that may have a large number of
locations
> >>>>>>> that
> >>>>>>> relate to it this may be expensive, but I see no other way.
> >>>>>>>
> >>>>>>> So, I was wondering, how do I index this to speed this up
as far as
> >>>>>>> possible. So, given the location Europe, what are the fields
I
> >>>>>>> should
> >>>>>>> include in its row and how to index them? I could create
a column
> >>>>>>> family for
> >>>>>>> each relationship type with a label - the label being the
id of the
> >>>>>>> location
> >>>>>>> this location is related to, so, for e.g., the Europe row
would
> >>>>>>> have a
> >>>>>>> column called CONTAIN:England (assuming "England" is the
id for the
> >>>>>>> England
> >>>>>>> column - in reality it would be a UUID). I would then have
as many
> >>>>>>> labels
> >>>>>>> under the CONTAIN family for Europe as locations that Europe
> >>>>>>> contains.
> >>>>>>>
> >>>>>>> How would I index this and ensure that when deleting France
the
> >>>>>>> query:
> >>>>>>> "list
> >>>>>>> all locations that CONTAIN France" returns with Europe (and
> whatever
> >>>>>>> else)
> >>>>>>> as quickly as possible?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ishaaq
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/indexing-question-tp24318679p24340332.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message