hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Xie <nick.xie.had...@gmail.com>
Subject Re: HBase 6x bigger than raw data
Date Mon, 27 Jan 2014 23:02:29 GMT
Thanks all for the information. Appreciated!! I'll take a look and try.

Thanks,

Nick




On Mon, Jan 27, 2014 at 2:43 PM, Vladimir Rodionov
<vrodionov@carrieriq.com>wrote:

> Overhead of storing small values is quite high in HBase unless you use
> DATA_BLOCK_ENCODING
> (not available in 0.92). I recommend you moving to 0.94.latest.
>
> See: https://issues.apache.org/jira/browse/HBASE-4218
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Nick Xie [nick.xie.hadoop@gmail.com]
> Sent: Monday, January 27, 2014 2:40 PM
> To: user@hbase.apache.org
> Subject: Re: HBase 6x bigger than raw data
>
> Tom,
>
> Yes, you are right. According to this analysis (
>
> http://prafull-blog.blogspot.in/2012/06/how-to-calculate-record-size-of-hbase.html
> )
> if it is right, then the overhead is quite big if the cell value
> occupies
> a small portion.
>
> In the analysis in that link, the overhead is actually 10x!!!!(the real
> values only takes 12B and it costs 123B in HBase to store them...) Is that
> real????
>
> In this case, should we do some combination to reduce the overhead?
>
> Thanks,
>
> Nick
>
>
>
>
> On Mon, Jan 27, 2014 at 2:33 PM, Tom Brown <tombrown52@gmail.com> wrote:
>
> > I believe each cell stores its own copy of the entire row key, column
> > qualifier, and timestamp. Could that account for the increase in size?
> >
> > --Tom
> >
> >
> > On Mon, Jan 27, 2014 at 3:12 PM, Nick Xie <nick.xie.hadoop@gmail.com>
> > wrote:
> >
> > > I'm importing a set of data into HBase. The CSV file contains 82
> entries
> > > per line. Starting with 8 byte ID, followed by 16 byte date and the
> rest
> > > are 80 numbers with 4 bytes each.
> > >
> > > The current HBase schema is: ID as row key, date as a 'date' family
> with
> > > 'value' qualifier, the rest is in another family called 'readings' with
> > > 'P0', 'P1', 'P2', ... through 'P79' as qualifiers.
> > >
> > > I'm testing this on a single node cluster with HBase running in pseudo
> > > distributed mode (no replication, no compression for HBase)...After
> > > importing a CSV file with 150MB of size in HDFS(no replication), I
> > checked
> > > the the table size, and it shows ~900MB which is 6x times larger than
> it
> > is
> > > in HDFS....
> > >
> > > Why there is so large overhead on this? Am I doing anything wrong here?
> > >
> > > Thanks,
> > >
> > > Nick
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message