hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigertail <tyc...@yahoo.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 04:37:32 GMT

Hi St. Ack,

Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4 CPUs).
Suppose the 1M row keys are known beforehand and saved in an file, I just
read each key into a mapper and use table.getRow(key) to get the record. I
also tried to increase the # of map tasks, but it did not improve the
performance. Actually, even worse. Many tasks are failed/killed with sth
like "no response in 600 seconds."
 

stack-3 wrote:
>  
> For A2. below, how many map tasks?  How did you split the 1M you wanted 
> to fetch? How many of them ran concurrently?
> St.Ack
> 
> 
> tigertail wrote:
>> Hi, can anybody help? Hopefully the following can be helpful to make my
>> question clear if it was not in my last post. 
>>
>> A1. I created a table in HBase and then I inserted 10 million records
>> into
>> the table. 
>> A2. I ran a M/R program with totally 10 million "get by rowkey" operation
>> to
>> read the 10M records out and it took about 3 hours to finish.
>> A3. I also ran a M/R program which used TableMap to read the 10M records
>> out
>> and it just took 12 minutes.
>>
>> Now suppose I only need to read 1 million records whose row keys are
>> known
>> beforehand (and let's suppose the worst case the 1M records are evenly
>> distributed in the 10M records). 
>>
>> S1. I can use 1M "get by rowkey". But it is slow. 
>> S2. I can also simply use TableMap and only output the 10M records in the
>> map function but it actually read the whole table.
>>
>> Q1. Is there some more efficient way to read the 1M records, WITHOUT
>> PASSING
>> THOUGH THE WHOLE TABLE? 
>>
>> How about if I have 1 billion records in an HBase table and I only need
>> to
>> read 1 million records in the following two scenarios.
>>
>> Q2. suppose their row keys are known beforehand
>> Q3. or suppose these 1 million records have the same value on a column
>>
>> Any input would be greatly appreciated. Thank you so much!
>>
>>
>> tigertail wrote:
>>   
>>> For example, I have a HBase table with 1 billion records. Each record
>>> has
>>> a column named 'f1:testcol'. And I want to only get the records with
>>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1
>>> million such records, I would expect this would be must faster than I
>>> get
>>> all 1 billion records into my map function and then do condition check.
>>>
>>> By searching on this board and HBase documents, I tried to implement my
>>> own subclass of TableInputFormat and set a ColumnValueFilter in
>>> configure
>>> method. 
>>>
>>> public class TableInputFilterFormat extends TableInputFormat implements
>>>     JobConfigurable {
>>>   private final Log LOG =
>>> LogFactory.getLog(TableInputFilterFormat.class);
>>>
>>>   public static final String FILTER_LIST = "hbase.mapred.tablefilters";
>>>
>>>   public void configure(JobConf job) {
>>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>>>
>>>     String colArg = job.get(COLUMN_LIST);
>>>     String[] colNames = colArg.split(" ");
>>>     byte [][] m_cols = new byte[colNames.length][];
>>>     for (int i = 0; i < m_cols.length; i++) {
>>>       m_cols[i] = Bytes.toBytes(colNames[i]);
>>>     }
>>>     setInputColums(m_cols);
>>>
>>>     ColumnValueFilter filter = new
>>> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
>>> Bytes.toBytes("0"));
>>>     setRowFilter(filter);
>>>
>>>     try {
>>>       setHTable(new HTable(new HBaseConfiguration(job),
>>> tableNames[0].getName()));
>>>     } catch (Exception e) {
>>>       LOG.error(e);
>>>     }
>>>   }
>>> }
>>>
>>> However, The M/R job with RowFilter is much slower than the M/R job w/o
>>> RowFilter. During the process many tasked are failed with sth like "Task
>>> attempt_200812091733_0063_m_000019_1 failed to report status for 604
>>> seconds. Killing!". I am wondering if RowFilter can really decrease the
>>> record feeding from 1 billion to 1 million? If it cannot, is there any
>>> other method to address this issue?
>>>
>>> I am using Hadoop 0.18.2 and HBase 0.18.1.
>>>
>>> Thank you so much in advance!
>>>
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message