hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Wed, 17 Dec 2008 23:25:46 GMT
For A2. below, how many map tasks?  How did you split the 1M you wanted 
to fetch? How many of them ran concurrently?
St.Ack


tigertail wrote:
> Hi, can anybody help? Hopefully the following can be helpful to make my
> question clear if it was not in my last post. 
>
> A1. I created a table in HBase and then I inserted 10 million records into
> the table. 
> A2. I ran a M/R program with totally 10 million "get by rowkey" operation to
> read the 10M records out and it took about 3 hours to finish.
> A3. I also ran a M/R program which used TableMap to read the 10M records out
> and it just took 12 minutes.
>
> Now suppose I only need to read 1 million records whose row keys are known
> beforehand (and let's suppose the worst case the 1M records are evenly
> distributed in the 10M records). 
>
> S1. I can use 1M "get by rowkey". But it is slow. 
> S2. I can also simply use TableMap and only output the 10M records in the
> map function but it actually read the whole table.
>
> Q1. Is there some more efficient way to read the 1M records, WITHOUT PASSING
> THOUGH THE WHOLE TABLE? 
>
> How about if I have 1 billion records in an HBase table and I only need to
> read 1 million records in the following two scenarios.
>
> Q2. suppose their row keys are known beforehand
> Q3. or suppose these 1 million records have the same value on a column
>
> Any input would be greatly appreciated. Thank you so much!
>
>
> tigertail wrote:
>   
>> For example, I have a HBase table with 1 billion records. Each record has
>> a column named 'f1:testcol'. And I want to only get the records with
>> 'f1:testcol'=0 as the input to my map function. Suppose there are 1
>> million such records, I would expect this would be must faster than I get
>> all 1 billion records into my map function and then do condition check.
>>
>> By searching on this board and HBase documents, I tried to implement my
>> own subclass of TableInputFormat and set a ColumnValueFilter in configure
>> method. 
>>
>> public class TableInputFilterFormat extends TableInputFormat implements
>>     JobConfigurable {
>>   private final Log LOG = LogFactory.getLog(TableInputFilterFormat.class);
>>
>>   public static final String FILTER_LIST = "hbase.mapred.tablefilters";
>>
>>   public void configure(JobConf job) {
>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>>
>>     String colArg = job.get(COLUMN_LIST);
>>     String[] colNames = colArg.split(" ");
>>     byte [][] m_cols = new byte[colNames.length][];
>>     for (int i = 0; i < m_cols.length; i++) {
>>       m_cols[i] = Bytes.toBytes(colNames[i]);
>>     }
>>     setInputColums(m_cols);
>>
>>     ColumnValueFilter filter = new
>> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
>> Bytes.toBytes("0"));
>>     setRowFilter(filter);
>>
>>     try {
>>       setHTable(new HTable(new HBaseConfiguration(job),
>> tableNames[0].getName()));
>>     } catch (Exception e) {
>>       LOG.error(e);
>>     }
>>   }
>> }
>>
>> However, The M/R job with RowFilter is much slower than the M/R job w/o
>> RowFilter. During the process many tasked are failed with sth like "Task
>> attempt_200812091733_0063_m_000019_1 failed to report status for 604
>> seconds. Killing!". I am wondering if RowFilter can really decrease the
>> record feeding from 1 billion to 1 million? If it cannot, is there any
>> other method to address this issue?
>>
>> I am using Hadoop 0.18.2 and HBase 0.18.1.
>>
>> Thank you so much in advance!
>>
>>
>>     
>
>   


Mime
View raw message