hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigertail <tyc...@yahoo.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 21:52:00 GMT

Hi St. Ack,

***************************************************************************
1.
Firstly I need to thank you for your last reply, which urged me to re-check
my code, and I did find a stupid problem. 

In the map function of my old code I calls 

		HTable table = new HTable(conf, this.tableName);
		RowResult rowResult = table.getRow(key);

which basically means for each row i need to create a new "connection" to
the table. This is awkward!

In my new code I only create one such "connection" during job configuration
phase,

	public void configure(JobConf job) {
		String tableName = job.get(TABLENAME);
		try
		{
			setTable(job, tableName);
		} catch (Exception e) {
			LOG.error(e);
		}
	}

	private HTable table;
	protected void setTable(final JobConf job, final String tableName) throws
Exception{
		this.table = new HTable(new HBaseConfiguration(job), tableName);
	}

and then I just call

		RowResult rowResult = this.table.getRow(msgid);

With this revision, the job runs very stable now and takes 110 minutes to
read 10M records.
So for Q1, I can read 1M records in about 11 minutes, this looks ok.

***************************************************************************
2.

I use the default FileInputFormat so yes, the file is split into 26 pieces
(not 32, don't know why) and each mapper processed about 0.31 million
(~1/32nd part of the 10M records).

Yes, all eight boxes are running a regionserver.  There are 48 regions in my
table of 10M. 

>> When your MR that did A2. below ran, was the 'getting' distributed across
>> the regions of the table or were you banging on single region of the
>> table the whole time? 
Where can I check it? Though I think it should go across all regions because
I need to read all 10M records out.

I use Hadoop 0.18.2 and HBase 0.18.1. 
Thank for the answer to Q3 too. That is what I will try soon to build a
lucene index and see if searching based on the index can speed up
column-based reading.

-- 
View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21081633.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message