drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Grzywinski <...@aggregateknowledge.com>
Subject Re: Drill reading links
Date Sun, 26 Aug 2012 21:43:40 GMT
I would like to contribute some of the work that we did on field-striping to
the reading list:



It is a work-in-progress but it should be helpful to fill in a large number
of gaps that are in the Dremel paper. Unfortunately, our paper stops at
queries. We realized quite quickly the inherent limitations in the "trickle
reassembly" (as we've called it) that's outlined in the Dremel paper (i.e.
that only sub-tree pruning was supported) and started fishing around for
better query models. (We kept coming back to FLWOR expressions and couldn't
decide if it was the best/desirable approach. E.g. AQL
y) "felt better" than MongoDB's approach.)

Rob Grzywinski

On 8/25/12 6:36 PM, "Jason Frantz" <jfrantz@maprtech.com> wrote:

> Hi everyone,
> Before sending out an architecture doc, I wanted to send out a set of links
> to systems or research that have been influencing our design. Google's
> Dremel paper [1] does a good job at summarizing the use case of fast
> analytics, but is quite short on the actual system structure. In addition,
> we'd like to support some data models and execution patterns outside of
> what's mentioned in that paper.
> The overall picture can be very roughly broken down into three overlapping
> components. The first is the query language and data model exposed to the
> user. Our inspirations here are
> - SQL
> - BigQuery [2], which has a SQL-like language wrapped around a protocol
> buffer data model [3]
> - MongoDB, which has a JSON-derived data model
> The second component is the execution engine. The basic model is that each
> query is a data flow program structured as a DAG of execution nodes, as
> expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an
> operator that may be run across many machines. For examples of operators,
> see SQL Server [5].
> The third component is the storage format. There are several distinct types
> of formats we want to support:
> - Row-based w/o schema, e.g. JSON, CSV
> - Row-based w/ schema, e.g. traditional SQL, protobufs
> - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile
> Rather than relying on the user carefully creating a series of prebuilt
> indexes for anything they want to query, we'd like to rely on in-situ
> processing whenever possible. This includes adaptive indexing techniques
> like "database cracking" [7] as well as the ability to efficiently process
> "raw data" [8]. In addition, since we want to support several distinct data
> formats we need to transfer between those formats. One example is varying
> between JSON, which doesn't have a consistent "schema" from one row to the
> next, and protobufs, which do. Another example is the conversion from
> columnar format to row format [9].
> Please feel free to chime in with other references that the project should
> be looking into.
> -Jason
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/query-reference
> [3] https://developers.google.com/protocol-buffers/docs/proto
> [4] http://research.microsoft.com/en-us/projects/dryad/
> [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx
> [6] http://db.csail.mit.edu/projects/cstore/
> [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf
> [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf
> [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf

View raw message