1. I don't see why cache should be in columnar format. The only purpose of Dremel columnar format is to accelerate full table scans. That's it. 2. Scanners will be in C for performance reasons. Dremel idea = scan performance. On Wed, Sep 19, 2012 at 12:58 AM, moon soo Lee wrote: > i agree, working version first, and optimization later. > > Are there good reason that many input scanners expected in C? > > > > On Tue, Sep 18, 2012 at 12:11 PM, Ted Dunning > wrote: > > > I also generally agree, but I really think that we need a bit of > experience > > with a simple working version of Drill first. > > > > Also, anything like this is going to have to recognize that there are > > likely to be multiple columnar formats and that some (many) input > scanners > > are going to be coded in C, not just Java. > > > > On Mon, Sep 17, 2012 at 7:51 PM, Azuryy Yu wrote: > > > > > Thanks! > > > > > > Generally agree, but Cache and Data manipulation should be separated. > > every > > > query reach cache firstly, if not hit, then call the read data > interface, > > > which cannot be included in the cache module. > > > > > > so everybody can replace cache policy and read/write data. then can > > > configure drill.cache.policy.class and drill.read.class > drill.write.class > > > in the configure file. > > > > > > > > > On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee > > > wrote: > > > > > > > Here's my quick drill's common caching framework proposal. > > > > > > > > 0. Why > > > > > > > > - While In-place processing, data format is not guaranteed the > best > > > > efficient format to process (ie. columnar). > > > > - Non-columnar format can make huge performance impact. (order of > > > > magnitude) > > > > > > > > > > > > 1. Goal. > > > > > > > > - Increase performance without painful ETL > > > > - Performance includes not only overall throughput but also how > > > > interactive it is. > > > > - Provide easy implementation interface to datasource point of > view > > > > > > > > > > > > 2. How it looks? > > > > > > > > - Drill provide common caching policy. Which is responsible for > > > > > > > > - construct columnar format > > > > - read columnar format > > > > - caching algorithm > > > > > > > > > > > > - Each datasource optionally implements some method to support > > > caching, > > > > they could be > > > > > > > > interface CachingSupport { > > > > > > > > // to write columnar format data to cache media > > > > OutputStream getOutputStream(path); > > > > > > > > // to clear cached data > > > > void remove(path); > > > > > > > > // to read cached data > > > > InputStream getInputStream(path); > > > > > > > > // to get location information of data (in DFS) > > > > Location getLocation(path); > > > > > > > > } > > > > > > > > - The datasource implementation does not care about columnar > format, > > > > cache replacement policy, things. only care about basic IO. So > > people > > > > who > > > > implement datasource does not need to understand columnar things. > > > > > > > > > > > > 3. How it works? > > > > > > > > - Drill construct columnar format cache using datasource provided > > > > method. > > > > - Datasource can skip the implementation for the caching. This > time, > > > > drill work passthru mode. > > > > - Cache policy class can be replaced. So if there's more efficient > > > data > > > > format, efficient algorithm it can be applied, without changing > all > > > > datasource implementation. > > > > - Cache construction does not block data read. So performance > impact > > > > from cache construction is minimized. > > > > - Drill performs it's query through cache. There could be some > query > > > for > > > > cache management (like purge). > > > > > > > > > > > > > > > > Is it worth? or just adding a complexity? > > > > > > > > for me, worth +1. > > > > > > > > and i'm fully ready to do this job. :-) > > > > > > > > > > > > Thanks. > > > > > > > > ---- > > > > > > > > Leemoonsoo > > > > moon@nflabs.com > > > > > > > > > > > > On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran > > > > wrote: > > > > > > > > > The plan was to have the scan operator do that kind of caching, > but I > > > > agree > > > > > it could make sense to have some common caching framework in case > > other > > > > > scan operators want to cache as well. > > > > > > > > > > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee > > wrote: > > > > > > > > > > > Drill want In-place processing ([1], page 12). yes, ETL is > painful. > > > > > > In my understanding, In-place processing means the data is not > > always > > > > > > columnar. > > > > > > > > > > > > [2], Figure 10, shows performance difference between columnar and > > > > > > record-oriented (MR) > > > > > > if Dremel work with record-oriented data, I can guess that'll be > > > order > > > > of > > > > > > magnitude slower. > > > > > > > > > > > > If it's true, will this still interactive? > > > > > > > > > > > > And can anyone give an more detail about "Adaptively convert > > storage > > > > > layout > > > > > > into more efficient forms", [1], page 12 ? > > > > > > Is it kind of transparent columnar format caching? > > > > > > > > > > > > And if non-columnar data expected in many cases, > > > > > > then how about drill have common cache for storage interface > > instead > > > of > > > > > > each scanner implements their own caching policies? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > [1] Apache Drill, Architecture outlines. > > > > > > > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 > > > > > > [2] Dremel: Interactive Analysis of Web-Scale Datasets > > > > > > > > > > > > > > > > > > > > > > > > > > > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Tomer Shiran > > > > > Director of Product Management | MapR Technologies | 650-804-8657 > > > > > > > > > > > > > > >