hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Verkuylen <...@verkuylen.net>
Subject Re: Explosion in datasize using HBase as a MR sink
Date Tue, 04 Jun 2013 19:58:36 GMT
Finally fixed this, my code was at fault.

Protobufs require a builder object which was a (non static) protected object in an abstract
class all parsers extend. The mapper calls a parser factory depending on the input record.
Because we designed the parser instances as singletons, the builder object in the abstract
class got reused and all data got appended to the same builder. Doh! This only shows up in
a job, not in single tests. Ah well, I've learned a lot  :)

@Asaf we will be moving to LoadIncrementalHFiles asap. I had the code ready, but obviously
it showed the same size problems before the fix.

Thnx for the thoughts!

On May 31, 2013, at 22:02, Asaf Mesika <asaf.mesika@gmail.com> wrote:

> On your data set size, I would go on HFile OutputFormat and then bulk load in into HBase.
Why go through the Put flow anyway (memstore, flush, WAL), especially if you have the input
ready at your disposal for re-try if something fails?
> Sounds faster to me anyway.
> On May 30, 2013, at 10:52 PM, Rob Verkuylen <rob@verkuylen.net> wrote:
>> On May 30, 2013, at 4:51, Stack <stack@duboce.net> wrote:
>>> Triggering a major compaction does not alter the overall 217.5GB size?
>> A major compaction reduces the size from the original 219GB to the 217,5GB, so barely
a reduction. 
>> 80% of the region sizes are 1,4GB before and after. I haven't merged the smaller
>> but that still would not bring the size down to the 2,5-5 or so GB I would expect
given T2's size.
>>> You have speculative execution turned on in your MR job so its possible you
>>> write many versions?
>> I've turned off speculative execution (through conf.set) just for the mappers, since
we're not using reducers, should we? 
>> I will triple check the actual job settings in the job tracker, since I need to make
the settings on a job level.
>>> Does your MR job fail many tasks (and though it fails, until it fails, it
>>> will have written some subset of the task hence bloating your versions?).
>> We've had problems with failing mappers, because of zookeeper timeouts on large inserts,
>> we increased zookeeper timeout and blockingstorefiles to accommodate. Now we don't
>> get failures. This job writes to a cleanly made table, versions set to 1, so there
shouldn't be
>> extra versions I assume(?).
>>> You are putting everything into protobufs?  Could that be bloating your
>>> data?  Can you take a smaller subset and dump to the log a string version
>>> of the pb.  Use TextFormat
>>> https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat#shortDebugString(com.google.protobuf.MessageOrBuilder)
>> The protobufs reduce the size to roughly 40% of the original XML data in T1. 
>> The MR parser is a port of the python parse code we use going from T1 to T2.
>> I've done manual comparisons on 20-30 records from T2.1 and T2 and they are identical,

>> with only minute differences, because of slightly different parsing. I've done these
in hbase shell,
>> I will try log dumping them too.
>>> It can be informative looking at hfile content.  It could give you a clue
>>> as to the bloat.  See http://hbase.apache.org/book.html#hfile_tool
>> I will give this a go and report back. Any other debugging suggestions are more then
welcome :)
>> Thnx, Rob

View raw message