From drill-dev-return-383-apmail-incubator-drill-dev-archive=incubator.apache.org@incubator.apache.org Wed Sep 19 01:30:47 2012 Return-Path: X-Original-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8B9B9A6B for ; Wed, 19 Sep 2012 01:30:47 +0000 (UTC) Received: (qmail 17089 invoked by uid 500); 19 Sep 2012 01:30:47 -0000 Delivered-To: apmail-incubator-drill-dev-archive@incubator.apache.org Received: (qmail 17052 invoked by uid 500); 19 Sep 2012 01:30:47 -0000 Mailing-List: contact drill-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: drill-dev@incubator.apache.org Delivered-To: mailing list drill-dev@incubator.apache.org Received: (qmail 17041 invoked by uid 99); 19 Sep 2012 01:30:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2012 01:30:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pconstantine@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-wi0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Sep 2012 01:30:43 +0000 Received: by wibhq4 with SMTP id hq4so3561047wib.0 for ; Tue, 18 Sep 2012 18:30:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=eMIfeg7f61Ua3B3jEH0/P0XbHc5/vSwczp5lTcYUhRg=; b=HWTtkq0fCf47Ee0egWjVJoKIbBHw5qdAl7HSxkJfYhxRsZvIg9ZZUV2j7LvN7kR/XF vCC3B7f7A4eji4BGW2vWZFm1OTzmgXq630RQNpWHY3Yiq2r9G4eIUm+4rooPorGsDk+S NMVk5f0gwI7usvCYlsqxzrK8XxVr9b3DAYJOyVAGtimLcO1KdZ3clKRdmqcsMo2kqeOR jqkNbRjrTwV2lWqWvK5zHbRGh1UlgCO2FS6spU2QINupm6e1fqYdTS2yDeNvQZI9Y86h xi3hsATcl4DWyCPUyuhiH6TyJB6PjfNZMH6ZKV3Fo8I7BcCtHr9TX8UIYcVZynarpqRs 0HRw== MIME-Version: 1.0 Received: by 10.180.8.41 with SMTP id o9mr3152065wia.3.1348018221881; Tue, 18 Sep 2012 18:30:21 -0700 (PDT) Received: by 10.180.97.230 with HTTP; Tue, 18 Sep 2012 18:30:21 -0700 (PDT) In-Reply-To: References: Date: Wed, 19 Sep 2012 04:30:21 +0300 Message-ID: Subject: Re: In-place processing and performance. From: Constantine Peresypkin To: drill-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=f46d04428f344418bd04ca03f3c9 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04428f344418bd04ca03f3c9 Content-Type: text/plain; charset=ISO-8859-1 1. I don't see why cache should be in columnar format. The only purpose of Dremel columnar format is to accelerate full table scans. That's it. 2. Scanners will be in C for performance reasons. Dremel idea = scan performance. On Wed, Sep 19, 2012 at 12:58 AM, moon soo Lee wrote: > i agree, working version first, and optimization later. > > Are there good reason that many input scanners expected in C? > > > > On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning > wrote: > > > I also generally agree, but I really think that we need a bit of > experience > > with a simple working version of Drill first. > > > > Also, anything like this is going to have to recognize that there are > > likely to be multiple columnar formats and that some (many) input > scanners > > are going to be coded in C, not just Java. > > > > On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu wrote: > > > > > Thanks! > > > > > > Generally agree, but Cache and Data manipulation should be separated. > > every > > > query reach cache firstly, if not hit, then call the read data > interface, > > > which cannot be included in the cache module. > > > > > > so everybody can replace cache policy and read/write data. then can > > > configure drill.cache.policy.class and drill.read.class > drill.write.class > > > in the configure file. > > > > > > > > > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee > > > wrote: > > > > > > > Here's my quick drill's common caching framework proposal. > > > > > > > > 0. Why > > > > > > > > - While In-place processing, data format is not guaranteed the > best > > > > efficient format to process (ie. columnar). > > > > - Non-columnar format can make huge performance impact. (order of > > > > magnitude) > > > > > > > > > > > > 1. Goal. > > > > > > > > - Increase performance without painful ETL > > > > - Performance includes not only overall throughput but also how > > > > interactive it is. > > > > - Provide easy implementation interface to datasource point of > view > > > > > > > > > > > > 2. How it looks? > > > > > > > > - Drill provide common caching policy. Which is responsible for > > > > > > > > - construct columnar format > > > > - read columnar format > > > > - caching algorithm > > > > > > > > > > > > - Each datasource optionally implements some method to support > > > caching, > > > > they could be > > > > > > > > interface CachingSupport { > > > > > > > > // to write columnar format data to cache media > > > > OutputStream getOutputStream(path); > > > > > > > > // to clear cached data > > > > void remove(path); > > > > > > > > // to read cached data > > > > InputStream getInputStream(path); > > > > > > > > // to get location information of data (in DFS) > > > > Location getLocation(path); > > > > > > > > } > > > > > > > > - The datasource implementation does not care about columnar > format, > > > > cache replacement policy, things. only care about basic IO. So > > people > > > > who > > > > implement datasource does not need to understand columnar things. > > > > > > > > > > > > 3. How it works? > > > > > > > > - Drill construct columnar format cache using datasource provided > > > > method. > > > > - Datasource can skip the implementation for the caching. This > time, > > > > drill work passthru mode. > > > > - Cache policy class can be replaced. So if there's more efficient > > > data > > > > format, efficient algorithm it can be applied, without changing > all > > > > datasource implementation. > > > > - Cache construction does not block data read. So performance > impact > > > > from cache construction is minimized. > > > > - Drill performs it's query through cache. There could be some > query > > > for > > > > cache management (like purge). > > > > > > > > > > > > > > > > Is it worth? or just adding a complexity? > > > > > > > > for me, worth +1. > > > > > > > > and i'm fully ready to do this job. :-) > > > > > > > > > > > > Thanks. > > > > > > > > ---- > > > > > > > > Leemoonsoo > > > > moon@nflabs.com > > > > > > > > > > > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran > > > > wrote: > > > > > > > > > The plan was to have the scan operator do that kind of caching, > but I > > > > agree > > > > > it could make sense to have some common caching framework in case > > other > > > > > scan operators want to cache as well. > > > > > > > > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee > > wrote: > > > > > > > > > > > Drill want In-place processing ([1], page 12). yes, ETL is > painful. > > > > > > In my understanding, In-place processing means the data is not > > always > > > > > > columnar. > > > > > > > > > > > > [2], Figure 10, shows performance difference between columnar and > > > > > > record-oriented (MR) > > > > > > if Dremel work with record-oriented data, I can guess that'll be > > > order > > > > of > > > > > > magnitude slower. > > > > > > > > > > > > If it's true, will this still interactive? > > > > > > > > > > > > And can anyone give an more detail about "Adaptively convert > > storage > > > > > layout > > > > > > into more efficient forms", [1], page 12 ? > > > > > > Is it kind of transparent columnar format caching? > > > > > > > > > > > > And if non-columnar data expected in many cases, > > > > > > then how about drill have common cache for storage interface > > instead > > > of > > > > > > each scanner implements their own caching policies? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > [1] Apache Drill, Architecture outlines. > > > > > > > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 > > > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets > > > > > > > > > > > > > > > > > > > > > > > > > > > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Tomer Shiran > > > > > Director of Product Management | MapR Technologies | 650-804-8657 > > > > > > > > > > > > > > > --f46d04428f344418bd04ca03f3c9--