might look in to the api for there packages org.apache.hadoop.hbase.regionserver.tableindexed org.apache.hadoop.hbase.client.tableindexed http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html Not sure anything about them I never used but I thank it allows a index on columns Billy "Naveen Koorakula" wrote in message news:5b9fff10906100150m5a549d65h3ca440af3a37e2d5@mail.gmail.com... > That's correct - if you meant "it will have to scan EACH row in that > column > family with atleast one non-empty cell". > > From http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture: > "Each column family in a region is managed by an *HStore*. Each HStore may > have one or more *MapFiles* (a Hadoop HDFS file type) that is very similar > to a Google *SSTable*. Like SSTables, MapFiles are immutable once closed. > MapFiles are stored in the Hadoop HDFS." > > The way to think of this would be that each column family in the table has > its own file. The entries in the file look like: > key:family:label:timestamp value > > Since only non-empty table cells are stored in this file, when you're > scanning, you only are looking at all the rows that have non-empty values > for atleast one column label in the column family in question. > > For eg: assuming a column family "cf", the Mapfile for column family "cf" > might look like > > rowkey1 cf:label1 timestamp1 value1 > rowkey1 cf:label2 timestamp2 value2 > rowkey2 cf:label1 timestamp3 value3 > rowkey4 cf:label3 timestamp4 value4 > > Even if the scanner is looking for "cf:label2", it will still have to go > over the entire Mapfile to find these entries. That means it still has to > scan through and discard all the cf:label1 and cf:label3 entries to get to > the cf:label2 entries. (Note that in the above example, rowkey3 did not > have > a cf:labelX entry, therefore the scanner did not have to scan through that > row, even if rowkey3 did have values for other columns in the table) > > I would recommend reading through the Bigtable paper to understand the > data > model. (Caveat: HBase does deviate slightly from the Bigtable data model - > no access groups) > > Naveen > > On Tue, Jun 9, 2009 at 11:22 PM, Ric Wang > wrote: > >> Billy, >> >> Thank you, it's clearer to me now. But WITHIN the one family where the >> column-label that needs to be scanned over lives (since I only have one >> family for the entire table), it will still have to scan EVERY row in >> that >> family no matter if each cell on that column-label has value or not? >> >> -Ric >> >> >> On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson >> wrote: >> >> > It will not scan every row if there is more then one column family only >> the >> > rows that have data for that column. >> > >> > You do have parallelism when scanning large tables the mr job should be >> > splitting the job in to one mapper per region >> > if coded setup correctly. New patches in dev set for 0.20 will allow >> > more >> > mappers per region speeding up this in some cases. >> > >> > Row-based database can have index but they do not scale well index >> require >> > more memory >> > Hbase is designed to be Distributed parallel fault tolerant that scales >> > easy from 1 to hundreds to thousands of servers >> > >> > Billy >> > >> > >> > >> > "Ric Wang" wrote in >> > message >> > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com... >> > >> > Hi, >> >> >> >> Thanks. But if it is still scanning EVERY row in the entire table, how >> >> does >> >> HBase achieve better scan performance, compared to a row-based >> >> database? >> >> >> >> Thanks, >> >> Ric >> >> >> >> >> >> >> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson >> >> wrote: >> >> >> >> Without the use of indexes, there is no easy way to get the info >> without >> >>> touching every row. >> >>> >> >>> So yes you'll be scanning every row. But hbase has good bulk scan >> perf. >> >>> >> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" >> >>> wrote: >> >>> >> >>> How does the scanner know how to get ONLY the "relevant" rows, >> >>> without >> a >> >>> whole table scan? >> >>> >> >>> Thanks! >> >>> Ric >> >>> >> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula >> >>> >> >>> wrote: >> >>> > The scanner only s... >> >>> -- >> >>> >> >>> Ric Wang wqt.work@gmail.com >> >>> >> >>> >> >> >> >> >> >> -- >> >> Ric Wang >> >> wqt.work@gmail.com >> >> >> >> >> > >> > >> >> >> -- >> Ric Wang >> wqt.work@gmail.com >> >