hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harry Waye <hw...@arachnys.com>
Subject Re: Optimizing bulk load performance
Date Thu, 24 Oct 2013 15:16:10 GMT

I took a snapshot on the initial run, before the changes:

Good timing, disks appear to be exploding (ATA errors) atm thus I'm
decommissioning and reprovisioning with new disks.  I'll be reprovisioning
as without RAID (it's software RAID just to compound the issue) although
not sure how I'll go about migrating all nodes.  I guess I'd need to put
more correctly speced nodes in the rack and decommission the existing.
 Makes diff. to

We're using hetzner at the moment which may not have been a good choice.
 Has anyone had any experience with them wrt. Hadoop?  They offer 7 and 15
disk options, but are low on the cpu front (quad core).  Our workload will
be I assume on the high side.  There's also a 8 disk Dell PowerEdge what is
a little more powerful.  What hosting providers would people recommended?
 (And what would be the strategy for migrating?)

Anyhow, when I have things more stable I'll have a look at checking out
what's using the cpu.  In the mean time, can anything be gleamed from the
above snap?


On 24 October 2013 15:14, Jean-Marc Spaggiari <jean-marc@spaggiari.org>wrote:

> Hi Harry,
> Do you have more details on the exact load? Can you run vmstats and see
> what kind of load it is? Is it user? cpu? wio?
> I suspect your disks to be the issue. There is 2 things here.
> First, we don't recommend RAID for the HDFS/HBase disk. The best is to
> simply mount the disks on 2 mounting points and give them to HDFS.
> Second, 2 disks per not is very low. On a dev cluster is not even
> recommended. In production, you should go with 12 or more.
> So with only 2 disks in RAID, I suspect your WIO to be high which is what
> might slow your process.
> Can you take a look on that direction? If it's not that, we will continue
> to investigate ;)
> Thanks,
> JM
> 2013/10/23 Harry Waye <hwaye@arachnys.com>
> > I'm trying to load data into hbase using HFileOutputFormat and
> incremental
> > bulk load but am getting rather lackluster performance, 10h for ~0.5TB
> > data, ~50000 blocks.  This is being loaded into a table that has 2
> > families, 9 columns, 2500 regions and is ~10TB in size.  Keys are md5
> > hashes and regions are pretty evenly spread.  The majority of time
> appears
> > to be spend in the reduce phase, with the map phase completing very
> > quickly.  The network doesn't appear to be saturated, but the load is
> > consistently at 6 which is the number or reduce tasks per node.
> >
> > 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the rack).
> >
> > MR conf: 6 mappers, 6 reducers per node.
> >
> > I spoke to someone on IRC and they recommended reducing job output
> > replication to 1, and reducing the number of mappers which I reduced to
> 2.
> >  Reducing replication appeared not to make any difference, reducing
> > reducers appeared just to slow the job down.  I'm going to have a look at
> > running the benchmarks mentioned on Michael Noll's blog and see what that
> > turns up.  I guess some questions I have are:
> >
> > How does the global number/size of blocks affect perf.?  (I have a lot of
> > 10mb files, which are the input files)
> >
> > How does the job local number/size of input blocks affect perf.?
> >
> > What is actually happening in the reduce phase that requires so much CPU?
> >  I assume the actual construction of HFiles isn't intensive.
> >
> > Ultimately, how can I improve performance?
> > Thanks
> >

Harry Waye, Co-founder/CTO
+44 7890 734289

Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys>

Arachnys Information Services Limited is a company registered in England &
Wales. Company number: 7269723. Registered office: 40 Clarendon St,
Cambridge, CB1 1JX.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message