hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HDFS data locality
Date Tue, 17 Nov 2009 23:10:22 GMT
Flush files have good read locality as soon as they are written. As JD and Ryan say, after
the major compaction interval elapses in a cluster's lifetime, region store files generally
have good read locality also. This interval is configurable and you can also trigger it manually
via the shell or using the HBaseAdmin client API. Meanwhile, HDFS replicates writes for data
durability. I think you want that. 

    - Andy

From: Jean-Daniel Cryans <jdcryans@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Tue, November 17, 2009 2:51:05 PM
Subject: Re: HDFS data locality

The master doesn't assign in function of locality, we rely on the way
HDFS works. Also, it's almost impossible to assign regions based on
locality as all the files could be on a different node and moving it
around for the sake of locality would mean moving around possible
hundreds of GB...

So when you write a file to HDFS, you first write on the local
Datanode then it's streamed to other DNs. If you have a pretty normal
production cluster that stays up 24/7, the regions won't move around
so the new files created in the regions are on the same node. Also,
every 24 hours the major compaction thread rewrites all store files
into one (if needed) for each family and, again, you get locality.


On Tue, Nov 17, 2009 at 2:43 PM, Igor Katkov <ikatkov@gmail.com> wrote:
> Hi,
> When HMaster assigns regions to region servers does it try to ensure that
> these files will be located on the same host in HDFS? It does not, does not
> it?
> So most likely HBase RegionServers are very chatty over the network, reading
> and writing from/to the HDFS daemons on other nodes.
> Is there a way to improve it? To make RegionServer mostly talk to the local
> DataNode only?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message