drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: In-place processing and performance.
Date Wed, 19 Sep 2012 04:35:22 GMT
On Tue, Sep 18, 2012 at 6:30 PM, Constantine Peresypkin <
pconstantine@gmail.com> wrote:

> 1. I don't see why cache should be in columnar format. The only purpose of
> Dremel columnar format is to accelerate full table scans. That's it.
>

The cache is to make things fast.

Columnar cache will make the next query fast.


> 2. Scanners will be in C for performance reasons. Dremel idea = scan
> performance.
>

Scanners will be in whatever language the authors write them in.  I think
we need to preserve the option to write them in whatever language fits.
 Some serializations only have bindings in, say, Java.



>
> On Wed, Sep 19, 2012 at 12:58 AM, moon soo Lee <leemoonsoo@gmail.com>
> wrote:
>
> > i agree, working version first, and optimization later.
> >
> > Are there good reason that many input scanners expected in C?
> >
> >
> >
> > On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > I also generally agree, but I really think that we need a bit of
> > experience
> > > with a simple working version of Drill first.
> > >
> > > Also, anything like this is going to have to recognize that there are
> > > likely to be multiple columnar formats and that some (many) input
> > scanners
> > > are going to be coded in C, not just Java.
> > >
> > > On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> > >
> > > > Thanks!
> > > >
> > > > Generally agree, but Cache and Data manipulation should be separated.
> > > every
> > > > query reach cache firstly, if not hit, then call the read data
> > interface,
> > > > which cannot be included in the cache module.
> > > >
> > > > so everybody can replace cache policy and read/write data. then can
> > > > configure drill.cache.policy.class and drill.read.class
> > drill.write.class
> > > > in the configure file.
> > > >
> > > >
> > > > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee <leemoonsoo@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Here's my quick drill's common caching framework proposal.
> > > > >
> > > > > 0. Why
> > > > >
> > > > >    - While In-place processing, data format is not guaranteed the
> > best
> > > > >    efficient format to process (ie. columnar).
> > > > >    - Non-columnar format can make huge performance impact. (order
> of
> > > > >    magnitude)
> > > > >
> > > > >
> > > > > 1. Goal.
> > > > >
> > > > >    - Increase performance without painful ETL
> > > > >    - Performance includes not only overall throughput but also how
> > > > >    interactive it is.
> > > > >    - Provide easy implementation interface to datasource point of
> > view
> > > > >
> > > > >
> > > > > 2. How it looks?
> > > > >
> > > > >    - Drill provide common caching policy. Which is responsible for
> > > > >
> > > > >    - construct columnar format
> > > > >    - read columnar format
> > > > >    - caching algorithm
> > > > >
> > > > >
> > > > >    - Each datasource optionally implements some method to support
> > > > caching,
> > > > >    they could be
> > > > >
> > > > >    interface CachingSupport {
> > > > >
> > > > >    // to write columnar format data to cache media
> > > > >    OutputStream getOutputStream(path);
> > > > >
> > > > >    // to clear cached data
> > > > >    void remove(path);
> > > > >
> > > > >    // to read cached data
> > > > >    InputStream getInputStream(path);
> > > > >
> > > > >    // to get location information of data (in DFS)
> > > > >    Location getLocation(path);
> > > > >
> > > > >    }
> > > > >
> > > > >    - The datasource implementation does not care about columnar
> > format,
> > > > >    cache replacement policy, things. only care about basic IO. So
> > > people
> > > > > who
> > > > >    implement datasource does not need to understand columnar
> things.
> > > > >
> > > > >
> > > > > 3. How it works?
> > > > >
> > > > >    - Drill construct columnar format cache using datasource
> provided
> > > > > method.
> > > > >    - Datasource can skip the implementation for the caching. This
> > time,
> > > > >    drill work passthru mode.
> > > > >    - Cache policy class can be replaced. So if there's more
> efficient
> > > > data
> > > > >    format, efficient algorithm it can be applied, without changing
> > all
> > > > >    datasource implementation.
> > > > >    - Cache construction does not block data read. So performance
> > impact
> > > > >    from cache construction is minimized.
> > > > >    - Drill performs it's query through cache. There could be some
> > query
> > > > for
> > > > >    cache management (like purge).
> > > > >
> > > > >
> > > > >
> > > > > Is it worth? or just adding a complexity?
> > > > >
> > > > > for me, worth +1.
> > > > >
> > > > > and i'm fully ready to do this job. :-)
> > > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > > ----
> > > > >
> > > > > Leemoonsoo
> > > > > moon@nflabs.com
> > > > >
> > > > >
> > > > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran <
> tshiran@maprtech.com>
> > > > > wrote:
> > > > >
> > > > > > The plan was to have the scan operator do that kind of caching,
> > but I
> > > > > agree
> > > > > > it could make sense to have some common caching framework in
case
> > > other
> > > > > > scan operators want to cache as well.
> > > > > >
> > > > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee <moon@nflabs.com>
> > > wrote:
> > > > > >
> > > > > > > Drill want In-place processing ([1], page 12). yes, ETL
is
> > painful.
> > > > > > > In my understanding, In-place processing means the data
is not
> > > always
> > > > > > > columnar.
> > > > > > >
> > > > > > > [2], Figure 10, shows performance difference between columnar
> and
> > > > > > > record-oriented (MR)
> > > > > > > if Dremel work with record-oriented data, I can guess that'll
> be
> > > > order
> > > > > of
> > > > > > > magnitude slower.
> > > > > > >
> > > > > > > If it's true, will this still interactive?
> > > > > > >
> > > > > > > And can anyone give an more detail about "Adaptively convert
> > > storage
> > > > > > layout
> > > > > > > into more efficient forms", [1], page 12 ?
> > > > > > > Is it kind of transparent columnar format caching?
> > > > > > >
> > > > > > > And if non-columnar data expected in many cases,
> > > > > > > then how about drill have common cache for storage interface
> > > instead
> > > > of
> > > > > > > each scanner implements their own caching policies?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > [1] Apache Drill, Architecture outlines.
> > > > > > >
> > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
> > > > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Tomer Shiran
> > > > > > Director of Product Management | MapR Technologies |
> 650-804-8657
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message