hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: scanner on a given column: whole table scan or just the rows that have values
Date Wed, 10 Jun 2009 20:03:58 GMT
might look in to the api for there packages
org.apache.hadoop.hbase.regionserver.tableindexed
org.apache.hadoop.hbase.client.tableindexed
http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html

Not sure anything about them I never used but I thank it allows a index on 
columns

Billy


"Naveen Koorakula" <naveenk@gmail.com> wrote 
in message 
news:5b9fff10906100150m5a549d65h3ca440af3a37e2d5@mail.gmail.com...
> That's correct - if you meant "it will have to scan EACH row in that 
> column
> family with atleast one non-empty cell".
>
> From http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture:
> "Each column family in a region is managed by an *HStore*. Each HStore may
> have one or more *MapFiles* (a Hadoop HDFS file type) that is very similar
> to a Google *SSTable*. Like SSTables, MapFiles are immutable once closed.
> MapFiles are stored in the Hadoop HDFS."
>
> The way to think of this would be that each column family in the table has
> its own file. The entries in the file look like:
> key:family:label:timestamp value
>
> Since only non-empty table cells are stored in this file, when you're
> scanning, you only are looking at all the rows that have non-empty values
> for atleast one column label in the column family in question.
>
> For eg: assuming a column family "cf", the Mapfile for column family "cf"
> might look like
>
> rowkey1 cf:label1 timestamp1 value1
> rowkey1 cf:label2 timestamp2 value2
> rowkey2 cf:label1 timestamp3 value3
> rowkey4 cf:label3 timestamp4 value4
>
> Even if the scanner is looking for "cf:label2", it will still have to go
> over the entire Mapfile to find these entries. That means it still has to
> scan through and discard all the cf:label1 and cf:label3 entries to get to
> the cf:label2 entries. (Note that in the above example, rowkey3 did not 
> have
> a cf:labelX entry, therefore the scanner did not have to scan through that
> row, even if rowkey3 did have values for other columns in the table)
>
> I would recommend reading through the Bigtable paper to understand the 
> data
> model. (Caveat: HBase does deviate slightly from the Bigtable data model -
> no access groups)
>
> Naveen
>
> On Tue, Jun 9, 2009 at 11:22 PM, Ric Wang 
> <wqt.work@gmail.com> wrote:
>
>> Billy,
>>
>> Thank you, it's clearer to me now. But WITHIN the one family where the
>> column-label that needs to be scanned over lives (since I only have one
>> family for the entire table), it will still have to scan EVERY row in 
>> that
>> family no matter if each cell on that column-label has value or not?
>>
>> -Ric
>>
>>
>> On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson
>> <sales@pearsonwholesale.com>wrote:
>>
>> > It will not scan every row if there is more then one column family only
>> the
>> > rows that have data for that column.
>> >
>> > You do have parallelism when scanning large tables the mr job should be
>> > splitting the job in to one mapper per region
>> > if coded setup correctly. New patches in dev set for 0.20 will allow 
>> > more
>> > mappers per region speeding up this in some cases.
>> >
>> > Row-based database can have index but they do not scale well index
>> require
>> > more memory
>> > Hbase is designed to be Distributed parallel fault tolerant that scales
>> > easy from 1 to hundreds to thousands of servers
>> >
>> > Billy
>> >
>> >
>> >
>> > "Ric Wang" <wqt.work@gmail.com> wrote in 
>> > message
>> > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
>> >
>> >  Hi,
>> >>
>> >> Thanks. But if it is still scanning EVERY row in the entire table, how
>> >> does
>> >> HBase achieve better scan performance, compared to a row-based 
>> >> database?
>> >>
>> >> Thanks,
>> >> Ric
>> >>
>> >>
>> >>
>> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson 
>> >> <ryanobjc@gmail.com> wrote:
>> >>
>> >>  Without the use of indexes, there is no easy way to get the info
>> without
>> >>> touching every row.
>> >>>
>> >>> So yes you'll be scanning every row.  But hbase has good bulk scan
>> perf.
>> >>>
>> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" 
>> >>> <wqt.work@gmail.com> wrote:
>> >>>
>> >>> How does the scanner know how to get ONLY the "relevant" rows, 
>> >>> without
>> a
>> >>> whole table scan?
>> >>>
>> >>> Thanks!
>> >>> Ric
>> >>>
>> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula 
>> >>> <naveenk@gmail.com>
>> >>> wrote:
>> >>> > The scanner only s...
>> >>> --
>> >>>
>> >>> Ric Wang wqt.work@gmail.com
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Ric Wang
>> >> wqt.work@gmail.com
>> >>
>> >>
>> >
>> >
>>
>>
>> --
>> Ric Wang
>> wqt.work@gmail.com
>>
> 



Mime
View raw message