hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Koorakula <nave...@gmail.com>
Subject Re: scanner on a given column: whole table scan or just the rows that have values
Date Wed, 10 Jun 2009 08:50:23 GMT
That's correct - if you meant "it will have to scan EACH row in that column
family with atleast one non-empty cell".

>From http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture:
"Each column family in a region is managed by an *HStore*. Each HStore may
have one or more *MapFiles* (a Hadoop HDFS file type) that is very similar
to a Google *SSTable*. Like SSTables, MapFiles are immutable once closed.
MapFiles are stored in the Hadoop HDFS."

The way to think of this would be that each column family in the table has
its own file. The entries in the file look like:
key:family:label:timestamp value

Since only non-empty table cells are stored in this file, when you're
scanning, you only are looking at all the rows that have non-empty values
for atleast one column label in the column family in question.

For eg: assuming a column family "cf", the Mapfile for column family "cf"
might look like

rowkey1 cf:label1 timestamp1 value1
rowkey1 cf:label2 timestamp2 value2
rowkey2 cf:label1 timestamp3 value3
rowkey4 cf:label3 timestamp4 value4

Even if the scanner is looking for "cf:label2", it will still have to go
over the entire Mapfile to find these entries. That means it still has to
scan through and discard all the cf:label1 and cf:label3 entries to get to
the cf:label2 entries. (Note that in the above example, rowkey3 did not have
a cf:labelX entry, therefore the scanner did not have to scan through that
row, even if rowkey3 did have values for other columns in the table)

I would recommend reading through the Bigtable paper to understand the data
model. (Caveat: HBase does deviate slightly from the Bigtable data model -
no access groups)

Naveen

On Tue, Jun 9, 2009 at 11:22 PM, Ric Wang <wqt.work@gmail.com> wrote:

> Billy,
>
> Thank you, it's clearer to me now. But WITHIN the one family where the
> column-label that needs to be scanned over lives (since I only have one
> family for the entire table), it will still have to scan EVERY row in that
> family no matter if each cell on that column-label has value or not?
>
> -Ric
>
>
> On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson
> <sales@pearsonwholesale.com>wrote:
>
> > It will not scan every row if there is more then one column family only
> the
> > rows that have data for that column.
> >
> > You do have parallelism when scanning large tables the mr job should be
> > splitting the job in to one mapper per region
> > if coded setup correctly. New patches in dev set for 0.20 will allow more
> > mappers per region speeding up this in some cases.
> >
> > Row-based database can have index but they do not scale well index
> require
> > more memory
> > Hbase is designed to be Distributed parallel fault tolerant that scales
> > easy from 1 to hundreds to thousands of servers
> >
> > Billy
> >
> >
> >
> > "Ric Wang" <wqt.work@gmail.com> wrote in message
> > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
> >
> >  Hi,
> >>
> >> Thanks. But if it is still scanning EVERY row in the entire table, how
> >> does
> >> HBase achieve better scan performance, compared to a row-based database?
> >>
> >> Thanks,
> >> Ric
> >>
> >>
> >>
> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> >>
> >>  Without the use of indexes, there is no easy way to get the info
> without
> >>> touching every row.
> >>>
> >>> So yes you'll be scanning every row.  But hbase has good bulk scan
> perf.
> >>>
> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" <wqt.work@gmail.com> wrote:
> >>>
> >>> How does the scanner know how to get ONLY the "relevant" rows, without
> a
> >>> whole table scan?
> >>>
> >>> Thanks!
> >>> Ric
> >>>
> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <naveenk@gmail.com>
> >>> wrote:
> >>> > The scanner only s...
> >>> --
> >>>
> >>> Ric Wang wqt.work@gmail.com
> >>>
> >>>
> >>
> >>
> >> --
> >> Ric Wang
> >> wqt.work@gmail.com
> >>
> >>
> >
> >
>
>
> --
> Ric Wang
> wqt.work@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message