hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad" <erikhols...@gmail.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 19:02:30 GMT
Hi Tigertail!
I have written some MR jobs earlier but nothing fancy like implementing your
own filter like
you, but what I do know it that you can specify the columns that you want to
read as the
input to the maptask. But since I'm not sure how that filter process is
handled internally I
can say if it reads in all the columns and than filter them out or how it
actually does it, please
let me know how it works, you people out there that have this knowledge :).

But you could try to have a column family age: and than have 1 column for
every age that you
want to be able to specify, for example age:30 or something, so you don't
have to look at the
value of the column, but rather using the column itself as the key.

Hope that it helped you a little but, and please let me know what kind of
results that you come up with.

Regards Erik

On Thu, Dec 18, 2008 at 9:26 AM, tigertail <tyczjs@yahoo.com> wrote:

>
> Thanks Erik,
>
> What I want is either by row key values, or by a specific value in a
> column,
> to quickly return a small subset without reading all records into Mapper.
> So
> I actually have two questions :)
>
> For the column-based search, for example, I have 1 billion people records
> in
> the table, the row key is the "name", and there is an "age" column. Now I
> want to find the records with age=30. How can I avoid to read every record
> into mapper and then filter the output?
>
> For searching by row key values, let's suppose I have 1 million people's
> names. Is there a more efficient way than running 1 million times
> table.getRow(name), in case the "name" strings are randomly distributed
> (and
> hence it is useless to write a new getSplits)?
>
> >> Did you try to only put that column in there for the rows that you want
> >> to
> >> get and use that as an input
> >> to the MR?
>
> I am not sure I get you there. I can use
> TableInputFormatBase.setInputColums
> in my program to only return the "age' column, but still, I need to read
> every row from the table into mapper. Or my understanding is wrong, can you
> give more details on your thought?
>
> Thanks again.
>
>
>
> Erik Holstad wrote:
> >
> > Hi Tigertail!
> > Not sure if I understand you original problem correct, but it seemed to
> me
> > that you wanted to just get
> > the rows with the value 1 in a column, right?
> >
> > Did you try to only put that column in there for the rows that you want
> to
> > get and use that as an input
> > to the MR?
> >
> > I haven't timed my MR jobs with this approach so I'm not sure how it is
> > handled internally, but maybe
> > it is worth giving it a try.
> >
> > Regards Erik
> >
> > On Wed, Dec 17, 2008 at 8:37 PM, tigertail <tyczjs@yahoo.com> wrote:
> >
> >>
> >> Hi St. Ack,
> >>
> >> Thanks for your input. I ran 32 map tasks (I have 8 boxes with each 4
> >> CPUs).
> >> Suppose the 1M row keys are known beforehand and saved in an file, I
> just
> >> read each key into a mapper and use table.getRow(key) to get the record.
> >> I
> >> also tried to increase the # of map tasks, but it did not improve the
> >> performance. Actually, even worse. Many tasks are failed/killed with sth
> >> like "no response in 600 seconds."
> >>
> >>
> >> stack-3 wrote:
> >> >
> >> > For A2. below, how many map tasks?  How did you split the 1M you
> wanted
> >> > to fetch? How many of them ran concurrently?
> >> > St.Ack
> >> >
> >> >
> >> > tigertail wrote:
> >> >> Hi, can anybody help? Hopefully the following can be helpful to make
> >> my
> >> >> question clear if it was not in my last post.
> >> >>
> >> >> A1. I created a table in HBase and then I inserted 10 million records
> >> >> into
> >> >> the table.
> >> >> A2. I ran a M/R program with totally 10 million "get by rowkey"
> >> operation
> >> >> to
> >> >> read the 10M records out and it took about 3 hours to finish.
> >> >> A3. I also ran a M/R program which used TableMap to read the 10M
> >> records
> >> >> out
> >> >> and it just took 12 minutes.
> >> >>
> >> >> Now suppose I only need to read 1 million records whose row keys are
> >> >> known
> >> >> beforehand (and let's suppose the worst case the 1M records are
> evenly
> >> >> distributed in the 10M records).
> >> >>
> >> >> S1. I can use 1M "get by rowkey". But it is slow.
> >> >> S2. I can also simply use TableMap and only output the 10M records
in
> >> the
> >> >> map function but it actually read the whole table.
> >> >>
> >> >> Q1. Is there some more efficient way to read the 1M records, WITHOUT
> >> >> PASSING
> >> >> THOUGH THE WHOLE TABLE?
> >> >>
> >> >> How about if I have 1 billion records in an HBase table and I only
> >> need
> >> >> to
> >> >> read 1 million records in the following two scenarios.
> >> >>
> >> >> Q2. suppose their row keys are known beforehand
> >> >> Q3. or suppose these 1 million records have the same value on a
> column
> >> >>
> >> >> Any input would be greatly appreciated. Thank you so much!
> >> >>
> >> >>
> >> >> tigertail wrote:
> >> >>
> >> >>> For example, I have a HBase table with 1 billion records. Each
> record
> >> >>> has
> >> >>> a column named 'f1:testcol'. And I want to only get the records
with
> >> >>> 'f1:testcol'=0 as the input to my map function. Suppose there are
1
> >> >>> million such records, I would expect this would be must faster
than
> I
> >> >>> get
> >> >>> all 1 billion records into my map function and then do condition
> >> check.
> >> >>>
> >> >>> By searching on this board and HBase documents, I tried to implement
> >> my
> >> >>> own subclass of TableInputFormat and set a ColumnValueFilter in
> >> >>> configure
> >> >>> method.
> >> >>>
> >> >>> public class TableInputFilterFormat extends TableInputFormat
> >> implements
> >> >>>     JobConfigurable {
> >> >>>   private final Log LOG =
> >> >>> LogFactory.getLog(TableInputFilterFormat.class);
> >> >>>
> >> >>>   public static final String FILTER_LIST =
> >> "hbase.mapred.tablefilters";
> >> >>>
> >> >>>   public void configure(JobConf job) {
> >> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
> >> >>>
> >> >>>     String colArg = job.get(COLUMN_LIST);
> >> >>>     String[] colNames = colArg.split(" ");
> >> >>>     byte [][] m_cols = new byte[colNames.length][];
> >> >>>     for (int i = 0; i < m_cols.length; i++) {
> >> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
> >> >>>     }
> >> >>>     setInputColums(m_cols);
> >> >>>
> >> >>>     ColumnValueFilter filter = new
> >> >>>
> >>
> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
> >> >>> Bytes.toBytes("0"));
> >> >>>     setRowFilter(filter);
> >> >>>
> >> >>>     try {
> >> >>>       setHTable(new HTable(new HBaseConfiguration(job),
> >> >>> tableNames[0].getName()));
> >> >>>     } catch (Exception e) {
> >> >>>       LOG.error(e);
> >> >>>     }
> >> >>>   }
> >> >>> }
> >> >>>
> >> >>> However, The M/R job with RowFilter is much slower than the M/R
job
> >> w/o
> >> >>> RowFilter. During the process many tasked are failed with sth like
> >> "Task
> >> >>> attempt_200812091733_0063_m_000019_1 failed to report status for
604
> >> >>> seconds. Killing!". I am wondering if RowFilter can really decrease
> >> the
> >> >>> record feeding from 1 billion to 1 million? If it cannot, is there
> >> any
> >> >>> other method to address this issue?
> >> >>>
> >> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
> >> >>>
> >> >>> Thank you so much in advance!
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message