hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tigertail <tyc...@yahoo.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Wed, 17 Dec 2008 22:54:47 GMT

Hi, can anybody help? Hopefully the following can be helpful to make my
question clear if it was not in my last post. 

A1. I created a table in HBase and then I inserted 10 million records into
the table. 
A2. I ran a M/R program with totally 10 million "get by rowkey" operation to
read the 10M records out and it took about 3 hours to finish.
A3. I also ran a M/R program which used TableMap to read the 10M records out
and it just took 12 minutes.

Now suppose I only need to read 1 million records whose row keys are known
beforehand (and let's suppose the worst case the 1M records are evenly
distributed in the 10M records). 

S1. I can use 1M "get by rowkey". But it is slow. 
S2. I can also simply use TableMap and only output the 10M records in the
map function but it actually read the whole table.

Q1. Is there some more efficient way to read the 1M records, WITHOUT PASSING

How about if I have 1 billion records in an HBase table and I only need to
read 1 million records in the following two scenarios.

Q2. suppose their row keys are known beforehand
Q3. or suppose these 1 million records have the same value on a column

Any input would be greatly appreciated. Thank you so much!

tigertail wrote:
> For example, I have a HBase table with 1 billion records. Each record has
> a column named 'f1:testcol'. And I want to only get the records with
> 'f1:testcol'=0 as the input to my map function. Suppose there are 1
> million such records, I would expect this would be must faster than I get
> all 1 billion records into my map function and then do condition check.
> By searching on this board and HBase documents, I tried to implement my
> own subclass of TableInputFormat and set a ColumnValueFilter in configure
> method. 
> public class TableInputFilterFormat extends TableInputFormat implements
>     JobConfigurable {
>   private final Log LOG = LogFactory.getLog(TableInputFilterFormat.class);
>   public static final String FILTER_LIST = "hbase.mapred.tablefilters";
>   public void configure(JobConf job) {
>     Path[] tableNames = FileInputFormat.getInputPaths(job);
>     String colArg = job.get(COLUMN_LIST);
>     String[] colNames = colArg.split(" ");
>     byte [][] m_cols = new byte[colNames.length][];
>     for (int i = 0; i < m_cols.length; i++) {
>       m_cols[i] = Bytes.toBytes(colNames[i]);
>     }
>     setInputColums(m_cols);
>     ColumnValueFilter filter = new
> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
> Bytes.toBytes("0"));
>     setRowFilter(filter);
>     try {
>       setHTable(new HTable(new HBaseConfiguration(job),
> tableNames[0].getName()));
>     } catch (Exception e) {
>       LOG.error(e);
>     }
>   }
> }
> However, The M/R job with RowFilter is much slower than the M/R job w/o
> RowFilter. During the process many tasked are failed with sth like "Task
> attempt_200812091733_0063_m_000019_1 failed to report status for 604
> seconds. Killing!". I am wondering if RowFilter can really decrease the
> record feeding from 1 billion to 1 million? If it cannot, is there any
> other method to address this issue?
> I am using Hadoop 0.18.2 and HBase 0.18.1.
> Thank you so much in advance!

View this message in context: http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21063403.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message