hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From steven zhuang <steven.zhuang.1...@gmail.com>
Subject Re: get the impact hbase brings to HDFS, datanode log exploded after we started HBase.
Date Fri, 09 Apr 2010 06:08:09 GMT
hi, Stack,
             thanks, my concern is not the log size, but why there are so
many records in it. If they are caused by HBase, and since we don't have
much data in Hbase, then these access cannot be efficient.


On Fri, Apr 9, 2010 at 12:34 PM, Stack <stack@duboce.net> wrote:

> On Thu, Apr 8, 2010 at 7:42 PM, steven zhuang
> <steven.zhuang.1984@gmail.com> wrote:
> >>         In this case, lots of access records, but fairly less data than
> > usual Hadoop jobs, can we say usually there are many more blocks involved
> in
> > a Hbase HDFS access than in a Hadoop HDFS access, this cannot be
> efficient.
>
>
> For a random access, usually one block only is involved in hbase at
> least.  If you first indexed your content in HDFS, it'd be about the
> same.
>
> Generally we do scan on HBase.

> >  I know sometime there are small region store files, but if they are
> small,
> > they would be merged into one by compaction, right?
>
> The aim is storefiles of about the size of a single block. Usually I'd
> say we spill over into the second block.
>
> On compaction files are compacted and will tend to grow in size (add
> more blocks).
>
>
> >       Is there anyway we lower number of small data access? maybe by
> > setting higher rowcaching number, but that should be App dependent. Any
> > other options we can use to lower this number?
> >
>
> What are you reads like?  Lots of random reads?  Or are they all
> scans?  Do they adhere to any kind of pattern or are they random?
>
> most of the reads are scans. we do sequential reads on successive rows most
of the time, we designed our schema in that way.
we do data import too, although not very frequently, I am sure this will
cause a lot of hdfs log too.

Yes, you could up your cache size too.
>
> What is the problem you are trying to address?  Are you saying all the
> i/o is killing our HDFS or something?  Or is it just making big logs
> that you are trying to address?
>

My concern is actually the I/O performance, since we are new to HBase.
Before we have HBase, the daily log size is dozens of MB at most, now we can
see logs of hundreds MB. so we are curious if Hbase contributes most of the
log records, the number of HBase reads/writes might too much, and is there
any way we can improve the performance and lower the reads/writes times.


> > You could turn them off explicitly in your log4j.  That should help.
> >>
> >> Don't run DEBUG level in datanode logs.
> >>
> >>
> > we are running the cluster at INFO level.
> >
>
> Do you see the clienttrace loggings?  You could explicitly disable
> this class's loggings.  That should make a big difference in log size.
>

St.Ack
>
> >
> >> Other answers inlined below.
> >>
> >> On Thu, Apr 8, 2010 at 2:51 AM, steven zhuang
> >> <steven.zhuang.1984@gmail.com> wrote:
> >> >...
> >> >        At present, my idea is calculating the data IO quantity of both
> >> HDFS
> >> > and HBase for a given day, and with the result we can have a rough
> >> estimate
> >> > of the situation.
> >>
> >> Can you use the above noted clientrace logs to do this?  Are clients
> >> on different hosts -- i.e. the hdfs clients and hbase clients?  If so
> >> that'd make it easy enough.  Otherwise, it'd be a little difficult.
> >> There is probably an easier way but one (awkward) means of calculating
> >> would be by writing a mapreduce job that took clienttrace messages and
> >> al blocks in the filesystem and then had it sort the clienttrace
> >> messages that belong to the ${HBASE_ROOTDIR} subdirectory.
> >>
> >> yeah, the hbase regionserver and datanode are on same host. so I cannot
> get
> > the data read/written by HBase just from the datanode log.
> > the Map/Reduce way may have a problem, we can not get the historical
> block
> > info from HDFS file system, I mean there are lots of blocks been garbage
> > collected when we import or delete data.
> >
> >> >        One problem I met now is to decide from the regionserver log
> the
> >> > quantity of data been read/written by Hbase, should I count the
> lengths
> >> in
> >> > following log records as lengths of data been read/written?:
> >> >
> >> > org.apache.hadoop.hbase.regionserver.Store: loaded
> >> > /user/ccenterq/hbase/hbt2table2/165204266/queries/1091785486701083780,
> >> > isReference=false,
> >> > sequence id=1526201715, length=*72426373*, majorCompaction=true
> >> > 2010-03-04 01:11:54,262 DEBUG
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Started memstore flush for region table_word_in_doc, resort
> >> > all-2010/01/01,1267629092479. Current region memstore size *40.5m*
> >> >
> >> >        here I am not sure the *72426373/40.5m is the length (in byte)
> of
> >> > data read by HBase. *
> >>
> >> Thats just file size.  Above we opened a storefile and we just logged
> its
> >> size.
> >>
> >> We don't log how much we've read/written any where in hbase logs.
> >>
> >> St.Ack
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message