drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karthik tunga <karthik.tu...@gmail.com>
Subject Re: Drill reading links
Date Sun, 26 Aug 2012 00:08:38 GMT
Does this mean that data is stored in both row and column format ? or is
the data stored in one format based on some configuration ?

Cheers,
Karthik

On 25 August 2012 19:36, Jason Frantz <jfrantz@maprtech.com> wrote:

> Hi everyone,
>
> Before sending out an architecture doc, I wanted to send out a set of links
> to systems or research that have been influencing our design. Google's
> Dremel paper [1] does a good job at summarizing the use case of fast
> analytics, but is quite short on the actual system structure. In addition,
> we'd like to support some data models and execution patterns outside of
> what's mentioned in that paper.
>
> The overall picture can be very roughly broken down into three overlapping
> components. The first is the query language and data model exposed to the
> user. Our inspirations here are
> - SQL
> - BigQuery [2], which has a SQL-like language wrapped around a protocol
> buffer data model [3]
> - MongoDB, which has a JSON-derived data model
>
> The second component is the execution engine. The basic model is that each
> query is a data flow program structured as a DAG of execution nodes, as
> expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an
> operator that may be run across many machines. For examples of operators,
> see SQL Server [5].
>
> The third component is the storage format. There are several distinct types
> of formats we want to support:
> - Row-based w/o schema, e.g. JSON, CSV
> - Row-based w/ schema, e.g. traditional SQL, protobufs
> - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile
>
> Rather than relying on the user carefully creating a series of prebuilt
> indexes for anything they want to query, we'd like to rely on in-situ
> processing whenever possible. This includes adaptive indexing techniques
> like "database cracking" [7] as well as the ability to efficiently process
> "raw data" [8]. In addition, since we want to support several distinct data
> formats we need to transfer between those formats. One example is varying
> between JSON, which doesn't have a consistent "schema" from one row to the
> next, and protobufs, which do. Another example is the conversion from
> columnar format to row format [9].
>
> Please feel free to chime in with other references that the project should
> be looking into.
>
> -Jason
>
> [1] http://research.google.com/pubs/pub36632.html
> [2] https://developers.google.com/bigquery/docs/query-reference
> [3] https://developers.google.com/protocol-buffers/docs/proto
> [4] http://research.microsoft.com/en-us/projects/dryad/
> [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx
> [6] http://db.csail.mit.edu/projects/cstore/
> [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf
> [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf
> [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message