From hbase-user-return-4546-apmail-hadoop-hbase-user-archive=hadoop.apache.org@hadoop.apache.org Wed Jun 10 06:22:38 2009 Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 11814 invoked from network); 10 Jun 2009 06:22:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jun 2009 06:22:38 -0000 Received: (qmail 64776 invoked by uid 500); 10 Jun 2009 06:22:50 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 64705 invoked by uid 500); 10 Jun 2009 06:22:49 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 64695 invoked by uid 99); 10 Jun 2009 06:22:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 06:22:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wqt.work@gmail.com designates 209.85.219.219 as permitted sender) Received: from [209.85.219.219] (HELO mail-ew0-f219.google.com) (209.85.219.219) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 06:22:41 +0000 Received: by ewy19 with SMTP id 19so248782ewy.29 for ; Tue, 09 Jun 2009 23:22:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=Sp7tU5E0GmDgGpJm4gZvcSKDz+ftOebOaJHu8DRrN3w=; b=Il42oZCcf55ECyQks0pTqtAJ3BY9nuzQVTbF9sHHZuPaDy+85PGsLfwonZEe6Za5JB pAQg4H+nFEfWu8xvOTWnJD89Fq6lvmpCeQiLZCtl/GXC4aVvmePXXjDE4pFYH4OFDfrz kWKZ8XcXwy5Cgic4ooW72e5a+S4xqYq/nkcIs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=l++4Y6ml8XTw6vlOOonILB2sX92zulaXWKnykn7syXuAXOoN/viTYolDbuhmlRzF/f yqkBjA1HywhWW5sU9Gq3faH3mk6rS6Eqmx3apF2ptry7GpLDW6l1ja8NfDpYF+pLze/y E4RlKAa0VPBgfZYH2U0fZuhHfVZC8R7LMZ/d8= MIME-Version: 1.0 Received: by 10.210.126.5 with SMTP id y5mr1144592ebc.14.1244614940137; Tue, 09 Jun 2009 23:22:20 -0700 (PDT) In-Reply-To: References: <21224f560906091410x2a8f98e9v8ee7d72e90026749@mail.gmail.com> <5b9fff10906091431n2b18d1ddp53e8c96c501fb49b@mail.gmail.com> <21224f560906091923kf9da008y28f741a4720ad636@mail.gmail.com> <78568af10906091933x4581ec8ewd704e75ce07170e4@mail.gmail.com> <78568af10906091935q16c0e27cr781a5f121ce31782@mail.gmail.com> <21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com> From: Ric Wang Date: Wed, 10 Jun 2009 01:22:00 -0500 Message-ID: <21224f560906092322g295be375k20c10672c86e1b45@mail.gmail.com> Subject: Re: scanner on a given column: whole table scan or just the rows that have values To: hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001517478a5a63998e046bf87f2b X-Virus-Checked: Checked by ClamAV on apache.org --001517478a5a63998e046bf87f2b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Billy, Thank you, it's clearer to me now. But WITHIN the one family where the column-label that needs to be scanned over lives (since I only have one family for the entire table), it will still have to scan EVERY row in that family no matter if each cell on that column-label has value or not? -Ric On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson wrote: > It will not scan every row if there is more then one column family only the > rows that have data for that column. > > You do have parallelism when scanning large tables the mr job should be > splitting the job in to one mapper per region > if coded setup correctly. New patches in dev set for 0.20 will allow more > mappers per region speeding up this in some cases. > > Row-based database can have index but they do not scale well index require > more memory > Hbase is designed to be Distributed parallel fault tolerant that scales > easy from 1 to hundreds to thousands of servers > > Billy > > > > "Ric Wang" wrote in message > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com... > > Hi, >> >> Thanks. But if it is still scanning EVERY row in the entire table, how >> does >> HBase achieve better scan performance, compared to a row-based database? >> >> Thanks, >> Ric >> >> >> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson wrote: >> >> Without the use of indexes, there is no easy way to get the info without >>> touching every row. >>> >>> So yes you'll be scanning every row. But hbase has good bulk scan perf. >>> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" wrote: >>> >>> How does the scanner know how to get ONLY the "relevant" rows, without a >>> whole table scan? >>> >>> Thanks! >>> Ric >>> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula >>> wrote: >>> > The scanner only s... >>> -- >>> >>> Ric Wang wqt.work@gmail.com >>> >>> >> >> >> -- >> Ric Wang >> wqt.work@gmail.com >> >> > > -- Ric Wang wqt.work@gmail.com --001517478a5a63998e046bf87f2b--