hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@cawoom.com>
Subject Re: HBase entity relationship
Date Wed, 26 Nov 2014 10:39:50 GMT


Am 26.11.2014 um 08:05 schrieb jatinpreet:
> I am curious to know if a hybrid of approaches 2 and 3 could be used. This
> means having the rowkeys of jobs inside source row like in approach 2. And
> having the parent source rowkey as a column in job row as in approach 3. 
If you just use "option 2" and add a "source" qualifier to the meta data
in the job as you mentioned, you would have such an approach.

But I don't see the an advantage of this idea. But there are
disadvantages. By a hybrid approach you decouple the tables, which is a
bad plan.

Assume you delete a source, but don't delete the jobs (e.g. because the
operation was interrupted). If you use "source" as leading concept, the
data cannot be accessed anymore (as the source is missing), but in the
database. In the this case you have "dead data" in the database. But if
you rely somehow on the "source" (hybrid approach) in a job, you have a
"wild pointer". The first is not good, the second is really bad.


> Now If I want to access all jobs for a particular source, I have the below
> mentioned options.
> 
> 1. Get all job rowkeys from source row, and then use them to fetch the jobs.
> This would be a direct getRows operation and as the rows in HBase are
> indexed by rowkey, I expect this to be fast. But in the process I make two
> calls to two separate tables.
> 
> 2.  As I also have source rowkey in each job row as a column, I can use a
> filter to get all jobs in a single scan of the table but this column is not
> indexed and hence the whole table will be scanned naively if I am correct. 
> 
> Now, in terms of response time, which of these methods will be faster.

The first, as you pointed out.

For n jobs, you would have to make at most n+1 gets. Plan 2 would need a
lot more computations because in the worst case you would have to
iterate every job (which could be millions as you mentioned).

A "get" of a row is the fastest way to fetch data.

You could imagine the source as the "index" of the jobs, which is always
faster than a filter or map reduce.

> Also, I didn't get the idea behind 'if at some point a reevaluation of a
> source has to be done, you could simply use a row lock to prevent race
> conditions'. An elaboration of this would be great!
As I have an idea of what you try, I do not know all the requirements.
And as we all know requirements change over time. So an approach has to
make future use cases as easy as possible. So I tried to make some
points on the advantages.

I could imagine an use case, where you have have to reevaluate all
sources (e.g. "no jobs from the a specific sector anymore"). So you have
to iterate all sources and reevaluate the source one by one, kick the
old jobs, add new ones for the source etc. This would take some time for
each row/source and as you don't want to take down the whole service the
users would access the db at the time of the reevaluation. This could
produce race conditions, which you can avoid by using a row lock of the
specific row you are working on at the time.

Or if your customer/boss or yourself says, that new inserted sources has
to be deleted regularly (evil jobs from evil guys). So if you just make
a list and delete it naively, you could produce race conditions, which
you can avoid by row locks, too.

It was just a small remark on something that was on my mind at the
moment I was writing the mail ;).

Best wishes

Wilm

Mime
View raw message