drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From moon soo Lee <leemoon...@gmail.com>
Subject Re: In-place processing and performance.
Date Tue, 18 Sep 2012 21:58:36 GMT
i agree, working version first, and optimization later.

Are there good reason that many input scanners expected in C?



On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I also generally agree, but I really think that we need a bit of experience
> with a simple working version of Drill first.
>
> Also, anything like this is going to have to recognize that there are
> likely to be multiple columnar formats and that some (many) input scanners
> are going to be coded in C, not just Java.
>
> On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>
> > Thanks!
> >
> > Generally agree, but Cache and Data manipulation should be separated.
> every
> > query reach cache firstly, if not hit, then call the read data interface,
> > which cannot be included in the cache module.
> >
> > so everybody can replace cache policy and read/write data. then can
> > configure drill.cache.policy.class and drill.read.class drill.write.class
> > in the configure file.
> >
> >
> > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee <leemoonsoo@gmail.com>
> > wrote:
> >
> > > Here's my quick drill's common caching framework proposal.
> > >
> > > 0. Why
> > >
> > >    - While In-place processing, data format is not guaranteed the best
> > >    efficient format to process (ie. columnar).
> > >    - Non-columnar format can make huge performance impact. (order of
> > >    magnitude)
> > >
> > >
> > > 1. Goal.
> > >
> > >    - Increase performance without painful ETL
> > >    - Performance includes not only overall throughput but also how
> > >    interactive it is.
> > >    - Provide easy implementation interface to datasource point of view
> > >
> > >
> > > 2. How it looks?
> > >
> > >    - Drill provide common caching policy. Which is responsible for
> > >
> > >    - construct columnar format
> > >    - read columnar format
> > >    - caching algorithm
> > >
> > >
> > >    - Each datasource optionally implements some method to support
> > caching,
> > >    they could be
> > >
> > >    interface CachingSupport {
> > >
> > >    // to write columnar format data to cache media
> > >    OutputStream getOutputStream(path);
> > >
> > >    // to clear cached data
> > >    void remove(path);
> > >
> > >    // to read cached data
> > >    InputStream getInputStream(path);
> > >
> > >    // to get location information of data (in DFS)
> > >    Location getLocation(path);
> > >
> > >    }
> > >
> > >    - The datasource implementation does not care about columnar format,
> > >    cache replacement policy, things. only care about basic IO. So
> people
> > > who
> > >    implement datasource does not need to understand columnar things.
> > >
> > >
> > > 3. How it works?
> > >
> > >    - Drill construct columnar format cache using datasource provided
> > > method.
> > >    - Datasource can skip the implementation for the caching. This time,
> > >    drill work passthru mode.
> > >    - Cache policy class can be replaced. So if there's more efficient
> > data
> > >    format, efficient algorithm it can be applied, without changing all
> > >    datasource implementation.
> > >    - Cache construction does not block data read. So performance impact
> > >    from cache construction is minimized.
> > >    - Drill performs it's query through cache. There could be some query
> > for
> > >    cache management (like purge).
> > >
> > >
> > >
> > > Is it worth? or just adding a complexity?
> > >
> > > for me, worth +1.
> > >
> > > and i'm fully ready to do this job. :-)
> > >
> > >
> > > Thanks.
> > >
> > > ----
> > >
> > > Leemoonsoo
> > > moon@nflabs.com
> > >
> > >
> > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran <tshiran@maprtech.com>
> > > wrote:
> > >
> > > > The plan was to have the scan operator do that kind of caching, but I
> > > agree
> > > > it could make sense to have some common caching framework in case
> other
> > > > scan operators want to cache as well.
> > > >
> > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee <moon@nflabs.com>
> wrote:
> > > >
> > > > > Drill want In-place processing ([1], page 12). yes, ETL is painful.
> > > > > In my understanding, In-place processing means the data is not
> always
> > > > > columnar.
> > > > >
> > > > > [2], Figure 10, shows performance difference between columnar and
> > > > > record-oriented (MR)
> > > > > if Dremel work with record-oriented data, I can guess that'll be
> > order
> > > of
> > > > > magnitude slower.
> > > > >
> > > > > If it's true, will this still interactive?
> > > > >
> > > > > And can anyone give an more detail about "Adaptively convert
> storage
> > > > layout
> > > > > into more efficient forms", [1], page 12 ?
> > > > > Is it kind of transparent columnar format caching?
> > > > >
> > > > > And if non-columnar data expected in many cases,
> > > > > then how about drill have common cache for storage interface
> instead
> > of
> > > > > each scanner implements their own caching policies?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > [1] Apache Drill, Architecture outlines.
> > > > > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
> > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets
> > > > >
> > > > >
> > > >
> > >
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Tomer Shiran
> > > > Director of Product Management | MapR Technologies | 650-804-8657
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message