hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Holstad" <erikhols...@gmail.com>
Subject Re: How to read a subset of records based on a column value in a M/R job?
Date Thu, 18 Dec 2008 20:36:04 GMT
Hi Tigertail!
Not sure if I understand you correct, but what I meant when I said to use a
schema like age:30 was
to only put the column in there if it has a value, so therefor you don't
have to check if is has a value
or not. The fact that is had that column in there is good enough. Not sure
if it made any sense.

Regards Erik


On Thu, Dec 18, 2008 at 11:56 AM, tigertail <tyczjs@yahoo.com> wrote:

>
> FYI, I also tried to implement my subclass of TableInputFormat and in
> configure method I called
>
>    byte [] filName = Bytes.toBytes("f1:age");
>    byte [] filValue = Bytes.toBytes("30");
>    ColumnValueFilter filter = new
> ColumnValueFilter(colName,ColumnValueFilter.CompareOp.EQUAL, colValue);
>    setRowFilter(filter);
>
> But as I said in my 1st post, it seems to be even slower than reading all
> rows.
>
>
> tigertail wrote:
> >
> > Erik,
> >
> > As far as I know, the column filtering happens in TableInputFormatBase.
> We
> > can use setInputColumns to assign the column we want to return. And then
> > TableInputFormatBase will start a scanner for the column.
> >
> > Yes, we can use "age" as family and each age value as a column. But can
> it
> > avoid reading all rows like the following code does?
> >
> >   public void map(ImmutableBytesWritable row, RowResult value,
> >     OutputCollector<Text,Text> output,
> >     @SuppressWarnings("unused") Reporter reporter)
> >   throws IOException {
> >
> >       Cell cell = value.get("age:30".getBytes());
> >     if (cell==null)
> >     {
> >       return;
> >     }
> > ...
> > }
> >
> > Erik Holstad wrote:
> >>
> >> Hi Tigertail!
> >> I have written some MR jobs earlier but nothing fancy like implementing
> >> your
> >> own filter like
> >> you, but what I do know it that you can specify the columns that you
> want
> >> to
> >> read as the
> >> input to the maptask. But since I'm not sure how that filter process is
> >> handled internally I
> >> can say if it reads in all the columns and than filter them out or how
> it
> >> actually does it, please
> >> let me know how it works, you people out there that have this knowledge
> >> :).
> >>
> >> But you could try to have a column family age: and than have 1 column
> for
> >> every age that you
> >> want to be able to specify, for example age:30 or something, so you
> don't
> >> have to look at the
> >> value of the column, but rather using the column itself as the key.
> >>
> >> Hope that it helped you a little but, and please let me know what kind
> of
> >> results that you come up with.
> >>
> >> Regards Erik
> >>
> >> On Thu, Dec 18, 2008 at 9:26 AM, tigertail <tyczjs@yahoo.com> wrote:
> >>
> >>>
> >>> Thanks Erik,
> >>>
> >>> What I want is either by row key values, or by a specific value in a
> >>> column,
> >>> to quickly return a small subset without reading all records into
> >>> Mapper.
> >>> So
> >>> I actually have two questions :)
> >>>
> >>> For the column-based search, for example, I have 1 billion people
> >>> records
> >>> in
> >>> the table, the row key is the "name", and there is an "age" column. Now
> >>> I
> >>> want to find the records with age=30. How can I avoid to read every
> >>> record
> >>> into mapper and then filter the output?
> >>>
> >>> For searching by row key values, let's suppose I have 1 million
> people's
> >>> names. Is there a more efficient way than running 1 million times
> >>> table.getRow(name), in case the "name" strings are randomly distributed
> >>> (and
> >>> hence it is useless to write a new getSplits)?
> >>>
> >>> >> Did you try to only put that column in there for the rows that
you
> >>> want
> >>> >> to
> >>> >> get and use that as an input
> >>> >> to the MR?
> >>>
> >>> I am not sure I get you there. I can use
> >>> TableInputFormatBase.setInputColums
> >>> in my program to only return the "age' column, but still, I need to
> read
> >>> every row from the table into mapper. Or my understanding is wrong, can
> >>> you
> >>> give more details on your thought?
> >>>
> >>> Thanks again.
> >>>
> >>>
> >>>
> >>> Erik Holstad wrote:
> >>> >
> >>> > Hi Tigertail!
> >>> > Not sure if I understand you original problem correct, but it seemed
> >>> to
> >>> me
> >>> > that you wanted to just get
> >>> > the rows with the value 1 in a column, right?
> >>> >
> >>> > Did you try to only put that column in there for the rows that you
> >>> want
> >>> to
> >>> > get and use that as an input
> >>> > to the MR?
> >>> >
> >>> > I haven't timed my MR jobs with this approach so I'm not sure how it
> >>> is
> >>> > handled internally, but maybe
> >>> > it is worth giving it a try.
> >>> >
> >>> > Regards Erik
> >>> >
> >>> > On Wed, Dec 17, 2008 at 8:37 PM, tigertail <tyczjs@yahoo.com>
wrote:
> >>> >
> >>> >>
> >>> >> Hi St. Ack,
> >>> >>
> >>> >> Thanks for your input. I ran 32 map tasks (I have 8 boxes with
each
> 4
> >>> >> CPUs).
> >>> >> Suppose the 1M row keys are known beforehand and saved in an file,
I
> >>> just
> >>> >> read each key into a mapper and use table.getRow(key) to get the
> >>> record.
> >>> >> I
> >>> >> also tried to increase the # of map tasks, but it did not improve
> the
> >>> >> performance. Actually, even worse. Many tasks are failed/killed
with
> >>> sth
> >>> >> like "no response in 600 seconds."
> >>> >>
> >>> >>
> >>> >> stack-3 wrote:
> >>> >> >
> >>> >> > For A2. below, how many map tasks?  How did you split the
1M you
> >>> wanted
> >>> >> > to fetch? How many of them ran concurrently?
> >>> >> > St.Ack
> >>> >> >
> >>> >> >
> >>> >> > tigertail wrote:
> >>> >> >> Hi, can anybody help? Hopefully the following can be helpful
to
> >>> make
> >>> >> my
> >>> >> >> question clear if it was not in my last post.
> >>> >> >>
> >>> >> >> A1. I created a table in HBase and then I inserted 10
million
> >>> records
> >>> >> >> into
> >>> >> >> the table.
> >>> >> >> A2. I ran a M/R program with totally 10 million "get by
rowkey"
> >>> >> operation
> >>> >> >> to
> >>> >> >> read the 10M records out and it took about 3 hours to
finish.
> >>> >> >> A3. I also ran a M/R program which used TableMap to read
the 10M
> >>> >> records
> >>> >> >> out
> >>> >> >> and it just took 12 minutes.
> >>> >> >>
> >>> >> >> Now suppose I only need to read 1 million records whose
row keys
> >>> are
> >>> >> >> known
> >>> >> >> beforehand (and let's suppose the worst case the 1M records
are
> >>> evenly
> >>> >> >> distributed in the 10M records).
> >>> >> >>
> >>> >> >> S1. I can use 1M "get by rowkey". But it is slow.
> >>> >> >> S2. I can also simply use TableMap and only output the
10M
> records
> >>> in
> >>> >> the
> >>> >> >> map function but it actually read the whole table.
> >>> >> >>
> >>> >> >> Q1. Is there some more efficient way to read the 1M records,
> >>> WITHOUT
> >>> >> >> PASSING
> >>> >> >> THOUGH THE WHOLE TABLE?
> >>> >> >>
> >>> >> >> How about if I have 1 billion records in an HBase table
and I
> only
> >>> >> need
> >>> >> >> to
> >>> >> >> read 1 million records in the following two scenarios.
> >>> >> >>
> >>> >> >> Q2. suppose their row keys are known beforehand
> >>> >> >> Q3. or suppose these 1 million records have the same value
on a
> >>> column
> >>> >> >>
> >>> >> >> Any input would be greatly appreciated. Thank you so much!
> >>> >> >>
> >>> >> >>
> >>> >> >> tigertail wrote:
> >>> >> >>
> >>> >> >>> For example, I have a HBase table with 1 billion records.
Each
> >>> record
> >>> >> >>> has
> >>> >> >>> a column named 'f1:testcol'. And I want to only get
the records
> >>> with
> >>> >> >>> 'f1:testcol'=0 as the input to my map function. Suppose
there
> are
> >>> 1
> >>> >> >>> million such records, I would expect this would be
must faster
> >>> than
> >>> I
> >>> >> >>> get
> >>> >> >>> all 1 billion records into my map function and then
do condition
> >>> >> check.
> >>> >> >>>
> >>> >> >>> By searching on this board and HBase documents, I
tried to
> >>> implement
> >>> >> my
> >>> >> >>> own subclass of TableInputFormat and set a ColumnValueFilter
in
> >>> >> >>> configure
> >>> >> >>> method.
> >>> >> >>>
> >>> >> >>> public class TableInputFilterFormat extends TableInputFormat
> >>> >> implements
> >>> >> >>>     JobConfigurable {
> >>> >> >>>   private final Log LOG =
> >>> >> >>> LogFactory.getLog(TableInputFilterFormat.class);
> >>> >> >>>
> >>> >> >>>   public static final String FILTER_LIST =
> >>> >> "hbase.mapred.tablefilters";
> >>> >> >>>
> >>> >> >>>   public void configure(JobConf job) {
> >>> >> >>>     Path[] tableNames = FileInputFormat.getInputPaths(job);
> >>> >> >>>
> >>> >> >>>     String colArg = job.get(COLUMN_LIST);
> >>> >> >>>     String[] colNames = colArg.split(" ");
> >>> >> >>>     byte [][] m_cols = new byte[colNames.length][];
> >>> >> >>>     for (int i = 0; i < m_cols.length; i++) {
> >>> >> >>>       m_cols[i] = Bytes.toBytes(colNames[i]);
> >>> >> >>>     }
> >>> >> >>>     setInputColums(m_cols);
> >>> >> >>>
> >>> >> >>>     ColumnValueFilter filter = new
> >>> >> >>>
> >>> >>
> >>>
> ColumnValueFilter(Bytes.toBytes("f1:testcol"),ColumnValueFilter.CompareOp.EQUAL,
> >>> >> >>> Bytes.toBytes("0"));
> >>> >> >>>     setRowFilter(filter);
> >>> >> >>>
> >>> >> >>>     try {
> >>> >> >>>       setHTable(new HTable(new HBaseConfiguration(job),
> >>> >> >>> tableNames[0].getName()));
> >>> >> >>>     } catch (Exception e) {
> >>> >> >>>       LOG.error(e);
> >>> >> >>>     }
> >>> >> >>>   }
> >>> >> >>> }
> >>> >> >>>
> >>> >> >>> However, The M/R job with RowFilter is much slower
than the M/R
> >>> job
> >>> >> w/o
> >>> >> >>> RowFilter. During the process many tasked are failed
with sth
> >>> like
> >>> >> "Task
> >>> >> >>> attempt_200812091733_0063_m_000019_1 failed to report
status for
> >>> 604
> >>> >> >>> seconds. Killing!". I am wondering if RowFilter can
really
> >>> decrease
> >>> >> the
> >>> >> >>> record feeding from 1 billion to 1 million? If it
cannot, is
> >>> there
> >>> >> any
> >>> >> >>> other method to address this issue?
> >>> >> >>>
> >>> >> >>> I am using Hadoop 0.18.2 and HBase 0.18.1.
> >>> >> >>>
> >>> >> >>> Thank you so much in advance!
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >>
> >>>
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21066895.html
> >>> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21077276.html
> >>> Sent from the HBase User mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-read-a-subset-of-records-based-on-a-column-value-in-a-M-R-job--tp20963771p21079808.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message