drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Frantz <jfra...@maprtech.com>
Subject Re: Drill reading links
Date Sun, 26 Aug 2012 01:29:48 GMT
Yeah, we want the ability to query data stored in multiple formats. For
example, if you have a bunch of log files in HDFS, you should be able to
query them in place without having to run a load tool. The same principal
applies if you want to query a table in Cassandra - a query should be able
to process that data without a separate load step.

-Jason

On Sat, Aug 25, 2012 at 5:08 PM, karthik tunga <karthik.tunga@gmail.com>wrote:

> Does this mean that data is stored in both row and column format ? or is
> the data stored in one format based on some configuration ?
>
> Cheers,
> Karthik
>
> On 25 August 2012 19:36, Jason Frantz <jfrantz@maprtech.com> wrote:
>
> > Hi everyone,
> >
> > Before sending out an architecture doc, I wanted to send out a set of
> links
> > to systems or research that have been influencing our design. Google's
> > Dremel paper [1] does a good job at summarizing the use case of fast
> > analytics, but is quite short on the actual system structure. In
> addition,
> > we'd like to support some data models and execution patterns outside of
> > what's mentioned in that paper.
> >
> > The overall picture can be very roughly broken down into three
> overlapping
> > components. The first is the query language and data model exposed to the
> > user. Our inspirations here are
> > - SQL
> > - BigQuery [2], which has a SQL-like language wrapped around a protocol
> > buffer data model [3]
> > - MongoDB, which has a JSON-derived data model
> >
> > The second component is the execution engine. The basic model is that
> each
> > query is a data flow program structured as a DAG of execution nodes, as
> > expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an
> > operator that may be run across many machines. For examples of operators,
> > see SQL Server [5].
> >
> > The third component is the storage format. There are several distinct
> types
> > of formats we want to support:
> > - Row-based w/o schema, e.g. JSON, CSV
> > - Row-based w/ schema, e.g. traditional SQL, protobufs
> > - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile
> >
> > Rather than relying on the user carefully creating a series of prebuilt
> > indexes for anything they want to query, we'd like to rely on in-situ
> > processing whenever possible. This includes adaptive indexing techniques
> > like "database cracking" [7] as well as the ability to efficiently
> process
> > "raw data" [8]. In addition, since we want to support several distinct
> data
> > formats we need to transfer between those formats. One example is varying
> > between JSON, which doesn't have a consistent "schema" from one row to
> the
> > next, and protobufs, which do. Another example is the conversion from
> > columnar format to row format [9].
> >
> > Please feel free to chime in with other references that the project
> should
> > be looking into.
> >
> > -Jason
> >
> > [1] http://research.google.com/pubs/pub36632.html
> > [2] https://developers.google.com/bigquery/docs/query-reference
> > [3] https://developers.google.com/protocol-buffers/docs/proto
> > [4] http://research.microsoft.com/en-us/projects/dryad/
> > [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx
> > [6] http://db.csail.mit.edu/projects/cstore/
> > [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf
> > [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf
> > [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message