hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Multiple scans vs single scan with filters
Date Thu, 24 Feb 2011 03:53:24 GMT
Hi,


> With a record size of 1k, I'd guesstimate that going with more scans
> is going  to be better than one big scan.  This is because a scan that
> filters out  data still has to read that data from disk, and 1k rows
> are pretty  big.

Would your answer be different if Alex/you knew if that data was actually read 
from either the OS cache or MemStore?
One can tell if disk is doing IO (or not) by using iostat/vmstat, but what about 
MemStore?

Another thought.  When you have 1 scan you have one monolithic operation, so to 
speak.
But if you have N scans, you could parallelize them.... somehow.  Is this 
correct?

I found https://issues.apache.org/jira/browse/HBASE-1935 which sounds like it 
was reviewed, got positive feedback, went through 3 patch revisions by stack, 
but didn't get committed yet.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

> But nothing will beat hard numbers. Build a test setup and let us  know
> which approach works!



> On Wed, Feb 23, 2011 at 2:40  PM, Alex Baranau <alex.baranov.v@gmail.com>  
>wrote:
> > Hello,
> >
> > Would be great if somebody can share  thoughts/ideas/some numbers on the
> > following problem.
> >
> > We  have a reporting app. To fetch data for some chart/report we currently
> >  use multiple scans, usually 10-50. We fetch about 100 records with each  
>scan
> > which we use to construct a report.
> >
> > I've revised  data we store and code logic and see that we could really 
fetch
> > same  data with single scan by specifying filters to filter out data which
> >  doesn't fit the report params. In this case the scan range will be about
> >  100-200K records from which after filtering we'd get the same records as  
we
> > do currently fetch with multiple scans.
> >
> > So the  question is: given these numbers (10-50 scans fetching 100 records
> > each  VS 1 scan + filters on range of 100-200K records) will the 
optimization
> >  I have in mind really improve performance? Unfortunately we don't have  
good
> > volume of data currently to perform tests on. May be someone can  share
> > thoughts based solely on these numbers? Record size is about  1Kb.
> >
> > Thank you!
> > Alex Baranau
> >
> 

Mime
View raw message