hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From steven zhuang <steven.zhuang.1...@gmail.com>
Subject Re: get the impact hbase brings to HDFS, datanode log exploded after we started HBase.
Date Fri, 09 Apr 2010 09:24:01 GMT
hi, Stack,

                I checked some super big log, in which I found huge number
of HDFS_READ records, and the src and dst are the same nodes, the node where
META table is. And the date when the log file is super big is exactly the
same date when we imported data into HBase tables.

the record is something as follows:

2010-04-06 00:00:00,488 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.76.16.90:50010, dest: /10.76.16.90:34862, bytes: 132096, op: HDFS_READ,
cliID: DFSClient_-399218057, srvID:
DS-1883115035-10.76.16.90-50010-1241482560210, blockid:
blk_6744180467899014178_1371575
               The block ID may vary but some block id appears 3M times in
the daily log.

               So I think this super big log size is caused by data upload,
when a ejection is committed by hbase client, region server will try to
locate the record's location by referring to the META table.

               My question is: in my M/R upload program I used BatchUpdate
object to emit 1000 cells every time, for one row there could be hundreds of
such emit, so does the client locate(so lookup in META table) every time it
commits something?
               why the reads are from a node to itself?


On Fri, Apr 9, 2010 at 2:08 PM, steven zhuang
<steven.zhuang.1984@gmail.com>wrote:

> hi, Stack,
>              thanks, my concern is not the log size, but why there are so
> many records in it. If they are caused by HBase, and since we don't have
> much data in Hbase, then these access cannot be efficient.
>
>
> On Fri, Apr 9, 2010 at 12:34 PM, Stack <stack@duboce.net> wrote:
>
>> On Thu, Apr 8, 2010 at 7:42 PM, steven zhuang
>> <steven.zhuang.1984@gmail.com> wrote:
>> >>         In this case, lots of access records, but fairly less data than
>> > usual Hadoop jobs, can we say usually there are many more blocks
>> involved in
>> > a Hbase HDFS access than in a Hadoop HDFS access, this cannot be
>> efficient.
>>
>>
>> For a random access, usually one block only is involved in hbase at
>> least.  If you first indexed your content in HDFS, it'd be about the
>> same.
>>
>> Generally we do scan on HBase.
>
>> >  I know sometime there are small region store files, but if they are
>> small,
>> > they would be merged into one by compaction, right?
>>
>> The aim is storefiles of about the size of a single block. Usually I'd
>> say we spill over into the second block.
>>
>> On compaction files are compacted and will tend to grow in size (add
>> more blocks).
>>
>>
>> >       Is there anyway we lower number of small data access? maybe by
>> > setting higher rowcaching number, but that should be App dependent. Any
>> > other options we can use to lower this number?
>> >
>>
>> What are you reads like?  Lots of random reads?  Or are they all
>> scans?  Do they adhere to any kind of pattern or are they random?
>>
>> most of the reads are scans. we do sequential reads on successive rows
> most of the time, we designed our schema in that way.
> we do data import too, although not very frequently, I am sure this will
> cause a lot of hdfs log too.
>
> Yes, you could up your cache size too.
>>
>> What is the problem you are trying to address?  Are you saying all the
>> i/o is killing our HDFS or something?  Or is it just making big logs
>> that you are trying to address?
>>
>
> My concern is actually the I/O performance, since we are new to HBase.
> Before we have HBase, the daily log size is dozens of MB at most, now we
> can see logs of hundreds MB. so we are curious if Hbase contributes most of
> the log records, the number of HBase reads/writes might too much, and is
> there any way we can improve the performance and lower the reads/writes
> times.
>
>
>> > You could turn them off explicitly in your log4j.  That should help.
>> >>
>> >> Don't run DEBUG level in datanode logs.
>> >>
>> >>
>> > we are running the cluster at INFO level.
>> >
>>
>> Do you see the clienttrace loggings?  You could explicitly disable
>> this class's loggings.  That should make a big difference in log size.
>>
>
> St.Ack
>>
>> >
>> >> Other answers inlined below.
>> >>
>> >> On Thu, Apr 8, 2010 at 2:51 AM, steven zhuang
>> >> <steven.zhuang.1984@gmail.com> wrote:
>> >> >...
>> >> >        At present, my idea is calculating the data IO quantity of
>> both
>> >> HDFS
>> >> > and HBase for a given day, and with the result we can have a rough
>> >> estimate
>> >> > of the situation.
>> >>
>> >> Can you use the above noted clientrace logs to do this?  Are clients
>> >> on different hosts -- i.e. the hdfs clients and hbase clients?  If so
>> >> that'd make it easy enough.  Otherwise, it'd be a little difficult.
>> >> There is probably an easier way but one (awkward) means of calculating
>> >> would be by writing a mapreduce job that took clienttrace messages and
>> >> al blocks in the filesystem and then had it sort the clienttrace
>> >> messages that belong to the ${HBASE_ROOTDIR} subdirectory.
>> >>
>> >> yeah, the hbase regionserver and datanode are on same host. so I cannot
>> get
>> > the data read/written by HBase just from the datanode log.
>> > the Map/Reduce way may have a problem, we can not get the historical
>> block
>> > info from HDFS file system, I mean there are lots of blocks been garbage
>> > collected when we import or delete data.
>> >
>> >> >        One problem I met now is to decide from the regionserver log
>> the
>> >> > quantity of data been read/written by Hbase, should I count the
>> lengths
>> >> in
>> >> > following log records as lengths of data been read/written?:
>> >> >
>> >> > org.apache.hadoop.hbase.regionserver.Store: loaded
>> >> >
>> /user/ccenterq/hbase/hbt2table2/165204266/queries/1091785486701083780,
>> >> > isReference=false,
>> >> > sequence id=1526201715, length=*72426373*, majorCompaction=true
>> >> > 2010-03-04 01:11:54,262 DEBUG
>> >> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> > Started memstore flush for region table_word_in_doc, resort
>> >> > all-2010/01/01,1267629092479. Current region memstore size *40.5m*
>> >> >
>> >> >        here I am not sure the *72426373/40.5m is the length (in byte)
>> of
>> >> > data read by HBase. *
>> >>
>> >> Thats just file size.  Above we opened a storefile and we just logged
>> its
>> >> size.
>> >>
>> >> We don't log how much we've read/written any where in hbase logs.
>> >>
>> >> St.Ack
>> >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message