hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: Heterogeneous cluster
Date Wed, 12 Dec 2012 03:54:03 GMT

Hi Jean
     Hope you are clear from Harsh's reply.. Thanks Harsh
Pls always keep in mind the 2 layers HBase and under that the HDFS layer where the data actually
lies. When u do read HBase tables via MR, the read happens from regions not directly from
stored HFiles.  So yes if the job for region1 is running in N2 and region1 is in N1 there
will be an RPC to N1 and the DFS client in      N1 in turn may read the data from N1.. So
even if the data is replicated in N2 no data locality factor helping you here.

HDFS-2246 introduced the short circuit based read. You can get the detailed explanation of
how and when all it will be useful from the below mentioned path.
Also may be better to configure the HBase handled checksum option for better perfromance if
you are using 0.94.x version. [This will work only when the read is a short circuited local

From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, December 12, 2012 1:50 AM
To: user@hbase.apache.org
Subject: Re: Heterogeneous cluster


On Wed, Dec 12, 2012 at 12:18 AM, Jean-Marc Spaggiari
<jean-marc@spaggiari.org> wrote:
> Hi Anoop,
> Thanks for the clarification.
> So let's take one example.
> Let's say I have 4 nodes and a replication factor set to 3.
> I have a region hosted on N1, replicated on N2 and N3. Nothing about
> this region on N4.

The important bit is, pending further enhancements along this line,
"regions" are not replicated. Region's data is replicated on HDFS, but
a Region itself is not replicated. It is served from a single point
(where it is currently assigned). Region data read requests are done
via the RegionServer layer, not directly from DataNodes (from a client

> It's time to run a MR, and someone need to work on the given region.
> N1 is to busy, so region will be given to another node. Does it mean
> it will be given randomly between N2, N3 and N4?

HBase jobs submit with the split locations for each region being its
current assignee (at time of submission). This gives the "locality".

> If it's given to N4, it's missing an oportunity to get the data almost locally.

If your task gets assigned to any other node or if the region moves
after the job's begun, the data locality of the reads the regionserver
does may easily be affected, yes.

> Also, if the job is given to N2 or N3, are they going to remotly query
> the data over the network from N1? Or are they able to ready it from
> the replicate? Based on what you are saying, seems that they will
> retrieve it for N1. Is there not another oportunity to improve the
> process by reading from the replicated data and not from the master
> one?

As explained above, all reads go through the assigned regionserver. So
the concept of HDFS block replicas can't be applied here yet (I do
know enhancements around this are planned).

> When you are talking about "the short circuit read option", is  this
> something we need to enable as a property? Or it's more like a piece
> of code?

Its configs, and the speed-drug details are at
http://hbase.apache.org/book.html#perf.hdfs section "11.10.2.
Leveraging local data".

Harsh J
View raw message