drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Drill reading links
Date Sun, 26 Aug 2012 03:33:41 GMT
Two other interesting references include the BigQuery documentation [1] and
the Google visualization data source API.  The big query documentation
makes clear how very limited the Dremel engine actually is, and how useful
that can be in restricting the complexity of query execution and the data
source API provides a very interesting model of how a data source can
specify the degree of query pushdown that is available.

Where this takes us, I think, is that there are long-term and short-term
possibilities here that can mesh quite nicely.  In the long-term, it would
be nice to have all the potential that dryad and an unrestricted model
might provide.  In the short term, it would be nice to have even a
sequential version of a limited Dremel clone.

[1] https://developers.google.com/bigquery/docs/query-reference

[2] https://developers.google.com/chart/interactive/docs/dev/dsl_about

On Sat, Aug 25, 2012 at 9:49 PM, Jason Frantz <jfrantz@maprtech.com> wrote:

> Hi Camuel,
>
> It looks like you guys have built some pretty cool stuff with ZeroVM. I
> think that could be a promising direction for guaranteeing safety of a
> user-defined function. One of the benefits of a query language like
> BigQuery's is that it gives users enough tools out of the box that many or
> most analytical questions can be answered without custom code. But in those
> cases where custom code is needed, safety is a must.
>
> As far as execution goes, Dryad's DAG model seems to be the correct level
> of abstraction for handling SQL-/MongoDB-style queries. While not generally
> a great fit for recursive/iterative data flows, DAGs give enough structure
> that operators don't have to reimplement scheduling, fault tolerance,
> messaging, etc. There are a number of projects out there exploring
> different models (e.g. there are two Apache projects, Giraph and Hama,
> implementing BSP), as well as a bit of research on expanding Dryad-style
> execution to handle iteration. It doesn't seem very feasible to map the
> full space of computation back into SQL, though.
>
> -Jason
>
> On Sat, Aug 25, 2012 at 5:11 PM, Camuel Gilyadov <camuel@gmail.com> wrote:
>
> > Hi everyone,
> >
> > We have put together a design proposal for Apache Drill based on our
> > two-year experience with OpenDremel.
> >
> > The proposal:
> > http://www.slideshare.net/CamuelGilyadov/apache-drill-14071739.
> > It is veyr high-level and definitely needs more elaboration, I suggest
> > using this mailing list for that.
> >
> > Regarding to the overall architecture described below, it seems
> consistent
> > with our proposed design. The storage-format component is part of query
> > compiler, or more accurately a code-template library which query compiler
> > uses to generate query plan. I think restricting it to DAG doesn't bring
> > any extra benefit, it hard to think of query plan which is not DAG but
> why
> > restrict backend without clear benefit? The main point behind proposed
> > design is to make nodes completely generic, capable of executing any
> > arbitrary code and it seems in line with mentioned Dryad architecture.
> >
> > Please feel free to ask us any question regarding the proposed design or
> > OpenDremel/Dazo in general.
> >
> > As time allows we would elaborate specific topics in our blog:
> > http://bigdatacraft.com with crosslink here.
> >
> > Looking forward for a design document...
> >
> > On Sun, Aug 26, 2012 at 2:36 AM, Jason Frantz <jfrantz@maprtech.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Before sending out an architecture doc, I wanted to send out a set of
> > links
> > > to systems or research that have been influencing our design. Google's
> > > Dremel paper [1] does a good job at summarizing the use case of fast
> > > analytics, but is quite short on the actual system structure. In
> > addition,
> > > we'd like to support some data models and execution patterns outside of
> > > what's mentioned in that paper.
> > >
> > > The overall picture can be very roughly broken down into three
> > overlapping
> > > components. The first is the query language and data model exposed to
> the
> > > user. Our inspirations here are
> > > - SQL
> > > - BigQuery [2], which has a SQL-like language wrapped around a protocol
> > > buffer data model [3]
> > > - MongoDB, which has a JSON-derived data model
> > >
> > > The second component is the execution engine. The basic model is that
> > each
> > > query is a data flow program structured as a DAG of execution nodes, as
> > > expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an
> > > operator that may be run across many machines. For examples of
> operators,
> > > see SQL Server [5].
> > >
> > > The third component is the storage format. There are several distinct
> > types
> > > of formats we want to support:
> > > - Row-based w/o schema, e.g. JSON, CSV
> > > - Row-based w/ schema, e.g. traditional SQL, protobufs
> > > - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile
> > >
> > > Rather than relying on the user carefully creating a series of prebuilt
> > > indexes for anything they want to query, we'd like to rely on in-situ
> > > processing whenever possible. This includes adaptive indexing
> techniques
> > > like "database cracking" [7] as well as the ability to efficiently
> > process
> > > "raw data" [8]. In addition, since we want to support several distinct
> > data
> > > formats we need to transfer between those formats. One example is
> varying
> > > between JSON, which doesn't have a consistent "schema" from one row to
> > the
> > > next, and protobufs, which do. Another example is the conversion from
> > > columnar format to row format [9].
> > >
> > > Please feel free to chime in with other references that the project
> > should
> > > be looking into.
> > >
> > > -Jason
> > >
> > > [1] http://research.google.com/pubs/pub36632.html
> > > [2] https://developers.google.com/bigquery/docs/query-reference
> > > [3] https://developers.google.com/protocol-buffers/docs/proto
> > > [4] http://research.microsoft.com/en-us/projects/dryad/
> > > [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx
> > > [6] http://db.csail.mit.edu/projects/cstore/
> > > [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf
> > > [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf
> > > [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message