hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Doubt in HBase
Date Fri, 21 Aug 2009 03:33:15 GMT
The behavior of TableInputFormat is to schedule one mapper for every table region. 

In addition to what others have said already, if your reducer is doing little more than storing
data back into HBase (via TableOutputFormat), then you can consider writing results back to
HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which
happens within the Hadoop job framework as map outputs are input into reducers. For that type
of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something
like job.setNumReducers(0) will do the trick. 

Best regards,

   - Andy

From: john smith <js1987.smith@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 12:42:36 AM
Subject: Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.

Iam using Map Reduce in HBase in distributed mode .  I have a table which
spans across 5 region servers . I am using TableInputFormat to read the data
from the tables in the map . When i run the program , by default how many
map regions are created ? Is it one per region server or more ?

Also after the map task is over.. reduce task is taking a bit more time . Is
it due to moving the map output across the regionservers? i.e, moving the
values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same reducer
nearby )

Thanks :)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message