hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@almaden.ibm.com>
Subject Re: Multi get/put
Date Thu, 07 Aug 2008 00:42:46 GMT

stack <stack@duboce.net> wrote on 08/06/2008 05:32:09 PM:

> Jun Rao wrote:
> > In terms of performance, the biggest overhead comes from Hbase/Hadoop
ipc.
> > For simple queries, a search through ipc takes 3-4 times as long as
that
> > directly on HDFS. I guess a lot of the overhead is because of java
> > reflection in ipc proxy. Does Hbase have plans to make ipc more
efficient?
> >
> We do.  Its a priority.  0.3.0 hopefully.
>
> > HDFS adds another layer of overhead compared with local file system. A
> > search on HDFS (on a node that has a local copy of all data) can take
10
> > times as long as that on local file system. We suspect most overhead
comes
> > from reopening sockets in HDFS client.
> >
> Are you on a recent hbase Jun?  Hadoop RPC seems to be reusing
> connections in 0.17.1.  Maybe that will help.
>

Our tests were done on Hadoop 0.17.1.


> St.Ack
>
>
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
> > (408)927-1886 (phone)
> > (408)927-3215 (fax)
> >
> >
> >
> >

> >              stack

> >              <stack@duboce.net

> >              >
To
> >                                        hbase-user@hadoop.apache.org

> >              08/06/2008 01:42
cc
> >              PM

> >
Subject
> >                                        Re: Multi get/put

> >              Please respond to

> >              hbase-user@hadoop

> >                 .apache.org

> >

> >

> >

> >
> >
> >
> >
> > Ning Li wrote:
> >
> >>> Does you have to do a rewrite of the lucene index at compaction time?
> >>>
> > Or
> >
> >>> just call optimize?  (I suppose its the former if you need to clean
up
> >>> 'References' as per below where you talk of splits)
> >>>
> >>>
> >> What do you mean by "a rewrite of the lucene index"?
> >>
> >
> > In hbase, on split, daughters hold a reference to either the top or
> > bottom half of their parent region.  References are undone by
> > compactions; as part of compaction, the part of the parent referenced
by
> > the daughter gets written out to store files under the daughter.
> > Daughters try to undo references as promptly as possible because
regions
> > with references are not splitable (references to references, and so on,
> > would soon become unmanageble).
> >
> > In your description, you mentioned that daughter regions reference
their
> > parents' index.  When I said, 'a rewrite of the lucene index', I was
> > asking, as per hbase regions, if you followed the model and wrote a new
> > lucene index comprised of daughter-only content at compaction time.  Or
> > do you just 'optimize' and let the references build up so the daughter
> > of a daughter points all the ways up to the parent?
> >
> > Just wondering.
> >
> >
> >
> >>> Regards your 'on the other hand' above, thats a good point.  Have you
> >>> verified that if a regionerver is running on a datanode, that the
lucene
> >>> index is written local?  Would be interesting to know.
> >>>
> >>>
> >> That's HDFS's policy. See HDFS's FSNamesystem.getAdditionalBlock.
> >>
> >>
> > Sorry.  Yeah, of course.
> >
> > So, why do you think it so slow going via HDFS FileSystem when the data
> > is local?  Is it the block-orientated access or is there just a
high-tax
> > going via the HDFS FS interface?
> >
> > St.Ack
> >
> >
> >
>


Mime
View raw message