hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mat Hofschen <hofsc...@gmail.com>
Subject Re: Import into empty table
Date Thu, 12 Mar 2009 15:05:36 GMT
Hi Jonathan,
yes we do run DataNodes and RegionServers in parallel. The job is running as
a Mapreduce Job.

I did some monitoring (collectd) on the machine that hosts the first region
of the table. This machine is fully loaded, with CPU maxed out. It takes
about 10 minutes before the region is split the first time and after the
split the second region seems mostly to be placed on the same machine. When
the import is finished there are about 18 regions distributed over 8
regionservers. After this point imports are much faster.

I can see that our hardware will not be sufficient. For now this is a
testlab setup and will have to be upgraded.

One more question to understand the scenario better:
I have 120 reduce jobs running on all nodes and there is only one node that
hosts the initial region. Then all 120 reduce jobs are trying to write to
this one machine? What happens then if the region is split? Do some of the
Reduce Jobs notice that write ops go to a new region, or are they still
writing to the first region which then redirects traffic?

Thanks for your help
Matthias


On Wed, Mar 11, 2009 at 7:50 PM, Jonathan Gray <jlist@streamy.com> wrote:

> Mat,
>
> Do you have DataNodes hosted on the same machines with RegionServers?
>
> Is this import job running as a MapReduce?
>
> You have 4 maps and 4 reduces per node, plus the DN and the RS.  I'd
> recommend at the very least to have 4 cores, or 8 if you have CPU intensive
> MR jobs.
>
> Before memory becomes an issue, you're going to be quickly CPU bound
> between
> all three of these things running on a single core (hyperthreaded or not,
> even 2 cores may not be sufficient).
>
> I have had some luck with splitting my tables early on in the import, but
> this will only make a difference if you have fully randomized the insert
> order of your keys, as Ryan pointed out.
>
> Either way, you should probably have max map and reduce tasks set to 1 each
> per node.  Or another idea, since you have a decent number of nodes, you
> could segment your cluster a bit to prevent starvation and contention
> between 4+ jvms on a core.  Run HDFS separate from HBase and MR.  Would
> have
> to know more about what you're trying to do to help you figure out the best
> distribution.
>
> JG
>
> > -----Original Message-----
> > From: Mat Hofschen [mailto:hofschen@gmail.com]
> > Sent: Wednesday, March 11, 2009 1:15 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Import into empty table
> >
> > Hi all,
> > I am having trouble with importing a medium dataset into an empty new
> > table.
> > The import runs for about 60 minutes.
> > There is a lot of failed/killed tasks in this scenario and sometime the
> > import fails altogether.
> >
> > If I import a smaller subset into the empty table and then perform
> > manual
> > split of regions (via split button on webpage) and then import the
> > larger
> > dataset, the import runs for about 10 minutes.
> >
> > It seems to me that the performance bottleneck during the first import
> > is
> > the single region on the single cluster machine. This machine is
> > heavily
> > loaded. So my question is whether I can force hbase to split faster
> > during
> > heavy write operations and what tuning parameters may be affecting this
> > scenario.
> >
> > Thanks for your help,
> > Matthias
> >
> > p.s. here are the details
> >
> > Details:
> > 33 cluster machines in testlab (3 year old servers with hyperthreading
> > single core cpu) 1.5 gig of memory, debian 5 lenny 32bit
> > hadoop 0.19.0, hbase 0.19.0
> > -Xmx 500mb for java processes
> > hadoop
> > mapred.map.tasks=20
> > mapred.reduce.tasks=15
> > dfs.block.size=16777216
> > mapred.tasktracker.map.tasks.maximum=4
> > mapred.tasktracker.reduce.tasks.maximum=4
> >
> > hbase
> > hbase.hregion.max.filesize=67108864
> >
> > hbase table
> > 3 column families
> >
> > import file
> > 5 Mill records with 18 columns with 6 columns per family
> > filesize 1.1 gig csv-file
> > import via provided java SampleUploader
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message