drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Hausenblas <michael.hausenb...@gmail.com>
Subject Re: git commit: First commit
Date Tue, 04 Sep 2012 08:19:56 GMT
> use https://git-wip-us.apache.org/repos/asf/incubator-drill.git

Thanks, that worked.

Cheers,
	   Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 3 Sep 2012, at 22:49, Jim Donofrio wrote:

> use https://git-wip-us.apache.org/repos/asf/incubator-drill.git
> 
> On 09/03/2012 05:22 PM, Michael Hausenblas wrote:
>> Ted,
>> 
>>> First commit
>> Cool ;)
>> 
>> Tried to clone and got:
>> 
>>  git clone git://git-wip-us.apache.org/repos/asf?p=incubator-drill.git repo
>>  Cloning into repo...
>>  git-wip-us.apache.org[0: 140.211.11.121]: errno=Operation timed out
>>  fatal: unable to connect a socket (Operation timed out)
>> 
>> Also, it seems to not been listed on http://git.apache.org/ yet - could that be the
reason for me not being able to clone it?
>> 
>> Cheers,
>> 	   Michael
>> 
>> --
>> Michael Hausenblas
>> Ireland, Europe
>> http://mhausenblas.info/
>> 
>> On 3 Sep 2012, at 22:09, tdunning@apache.org wrote:
>> 
>>> Updated Branches:
>>>  refs/heads/master [created] 9229caa45
>>> 
>>> 
>>> First commit
>>> 
>>> Project: http://git-wip-us.apache.org/repos/asf/incubator-drill/repo
>>> Commit: http://git-wip-us.apache.org/repos/asf/incubator-drill/commit/9229caa4
>>> Tree: http://git-wip-us.apache.org/repos/asf/incubator-drill/tree/9229caa4
>>> Diff: http://git-wip-us.apache.org/repos/asf/incubator-drill/diff/9229caa4
>>> 
>>> Branch: refs/heads/master
>>> Commit: 9229caa45a32dc06625f2443b6a5d84ab0a4df10
>>> Parents:
>>> Author: Ted Dunning <ted.dunning@gmail.com>
>>> Authored: Mon Sep 3 13:21:32 2012 -0700
>>> Committer: Ted Dunning <ted.dunning@gmail.com>
>>> Committed: Mon Sep 3 13:21:32 2012 -0700
>>> 
>>> ----------------------------------------------------------------------
>>> README.md |  127 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 files changed, 127 insertions(+), 0 deletions(-)
>>> ----------------------------------------------------------------------
>>> 
>>> 
>>> http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/9229caa4/README.md
>>> ----------------------------------------------------------------------
>>> diff --git a/README.md b/README.md
>>> new file mode 100644
>>> index 0000000..51772a9
>>> --- /dev/null
>>> +++ b/README.md
>>> @@ -0,0 +1,127 @@
>>> += Drill =
>>> +
>>> +This is a copy of the original proposal for Drill, for now.  Please edit and
update as appropriate.
>>> +
>>> +== Abstract ==
>>> +Drill is a distributed system for interactive analysis of large-scale datasets,
inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>>> +
>>> +== Proposal ==
>>> +Drill is a distributed system for interactive analysis of large-scale datasets.
Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader
range of query languages, data formats and data sources. It is designed to efficiently process
nested data. It is a design goal to scale to 10,000 servers or more and to be able to process
petabyes of data and trillions of records in seconds.
>>> +
>>> +== Background ==
>>> +Many organizations have the need to run data-intensive applications, including
batch processing, stream processing and interactive analysis. In recent years open source
systems have emerged to address the need for scalable batch processing (Apache Hadoop) and
stream processing (Storm, Apache S4). In 2010 Google published a paper called "Dremel: Interactive
Analysis of Web-Scale Datasets," describing a scalable system used internally for interactive
analysis of nested data. No open source project has successfully replicated the capabilities
of Dremel.
>>> +
>>> +== Rationale ==
>>> +There is a strong need in the market for low-latency interactive analysis of
large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need
was identified by Google and addressed internally with a system called Dremel.
>>> +
>>> +In recent years open source systems have emerged to address the need for scalable
batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop,
originally inspired by Google's internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput,
but is not designed to achieve the sub-second latency needed for interactive data analysis
and exploration. Drill, inspired by Google's internal Dremel system, is intended to address
this need.
>>> +
>>> +It is worth noting that, as explained by Google in the original paper, Dremel
complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce
and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly
prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of
Google employees.
>>> +
>>> +Like Dremel, Drill supports a nested data model with data encoded in a number
of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the
standard, so supporting a nested data model eliminates the need to normalize the data. With
that said, flat data formats, such as CSV files, are naturally supported as a special case
of nested data.
>>> +
>>> +The Drill architecture consists of four key components/layers:
>>> + * Query languages: This layer is responsible for parsing the user's query and
constructing an execution plan.  The initial goal is to support the SQL-like language used
by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]],
which we call DrQL. However, Drill is designed to support other languages and programming
models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]].
>>> + * Low-latency distributed execution engine: This layer is responsible for executing
the physical plan. It provides the scalability and fault tolerance needed to efficiently query
petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and
can be extended with additional operators and connectors.
>>> + * Nested data formats: This layer is responsible for supporting various data
formats. The initial goal is to support the column-based format used by Dremel. Drill is designed
to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and
CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support
column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such
as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that
the execution engine is flexible enough to support column-based processing as well as row-based
processing. This is important because column-based processing can be much more efficient when
the data is stored in a column-based format, but many large data assets are stored in a row-based
format that would require conversion before use.
>>> + * Scalable data sources: This layer is responsible for supporting various data
sources. The initial focus is to leverage Hadoop as a data source.
>>> +
>>> +It is worth noting that no open source project has successfully replicated the
capabilities of Dremel, nor have any taken on the broader goals of flexibility (eg, pluggable
query languages, data formats, data sources and execution engine operators/connectors) that
are part of Drill.
>>> +
>>> +== Initial Goals ==
>>> +The initial goals for this project are to specify the detailed requirements
and architecture, and then develop the initial implementation including the execution engine
and DrQL.
>>> +Like Apache Hadoop, which was built to support multiple storage systems (through
the FileSystem API) and file formats (through the InputFormat/OutputFormat APIs), Drill will
be built to support multiple query languages, data formats and data sources. The initial implementation
of Drill will support the DrQL and a column-based format similar to Dremel.
>>> +
>>> +== Current Status ==
>>> +Significant work has been completed to identify the initial requirements and
define the overall system architecture. The next step is to implement the four components
described in the Rationale section, and we intend to do that development as an Apache project.
>>> +
>>> +=== Meritocracy ===
>>> +We plan to invest in supporting a meritocracy. We will discuss the requirements
in an open forum. Several companies have already expressed interest in this project, and we
intend to invite additional developers to participate. We will encourage and monitor community
participation so that privileges can be extended to those that contribute. Also, Drill has
an extensible/pluggable architecture that encourages developers to contribute various extensions,
such as query languages, data formats, data sources and execution engine operators and connectors.
While some companies will surely develop commercial extensions, we also anticipate that some
companies and individuals will want to contribute such extensions back to the project, and
we look forward to fostering a rich ecosystem of extensions.
>>> +
>>> +=== Community ===
>>> +The need for a system for interactive analysis of large datasets in the open
source is tremendous, so there is a potential for a very large community. We believe that
Drill's extensible architecture will further encourage community participation. Also, related
Apache projects (eg, Hadoop) have very large and active communities, and we expect that over
time Drill will also attract a large community.
>>> +
>>> +=== Core Developers ===
>>> +The developers on the initial committers list include experienced distributed
systems engineers:
>>> + * Tomer Shiran has experience developing distributed execution engines. He
developed Parallel DataSeries, a data-parallel version of the open source [[http://tesla.hpl.hp.com/opensource/|DataSeries]]
system. He is also the author of Applying Idealized Lower-bound Runtime Models to Understand
Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worked as a software developer
and researcher at IBM Research, Microsoft and HP Labs, and is now at MapR Technologies. He
has been active in the Hadoop community since 2009.
>>> + * Jason Frantz was at Clustrix, where he designed and developed the first scale-out
SQL database based on MySQL. Jason developed the distributed query optimizer that powered
Clustrix. He is now a software engineer and architect at MapR Technologies.
>>> + * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and has
a history of over 30 years of contributions to open source. He is now at MapR Technologies.
Ted has been very active in the Hadoop community since the project's early days.
>>> + * MC Srivas is the co-founder and CTO of MapR Technologies. While at Google
he worked on Google's scalable search infrastructure. MC Srivas has been active in the Hadoop
community since 2009.
>>> + * Chris Wensel is the founder and CEO of Concurrent. Prior to founding Concurrent,
he developed Cascading, an Apache-licensed open source application framework enabling Java
developers to quickly and easily develop robust Data Analytics and Data Management applications
on Apache Hadoop. Chris has been involved in the Hadoop community since the project's early
days.
>>> + * Keys Botzum was at IBM, where he worked on security and distributed systems,
and is currently at MapR Technologies.
>>> + * Gera Shegalov was at Oracle, where he worked on networking, storage and database
kernels, and is currently at MapR Technologies.
>>> + * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed Spire,
a real-time operational database for Hadoop. He is also a committer and PMC member for Apache
HBase, and has a long history of contributions to open source. Ryan has been involved in the
Hadoop community since the project's early days.
>>> +
>>> +We realize that additional employer diversity is needed, and we will work aggressively
to recruit developers from additional companies.
>>> +
>>> +=== Alignment ===
>>> +The initial committers strongly believe that a system for interactive analysis
of large-scale datasets will gain broader adoption as an open source, community driven project,
where the community can contribute not only to the core components, but also to a growing
collection of query languages and optimizers, data formats, data formats, and execution engine
operators and connectors. Drill will integrate closely with Apache Hadoop. First, the data
will live in Hadoop. That is, Drill will support Hadoop FileSystem implementations and HBase.
Second, Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based
tools will be provided to produce column-based formats. Fourth, Drill tables can be registered
in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.
>>> +
>>> +== Known Risks ==
>>> +
>>> +=== Orphaned Products ===
>>> +The contributors are leading vendors in this space, with significant open source
experience, so the risk of being orphaned is relatively low. The project could be at risk
if vendors decided to change their strategies in the market. In such an event, the current
committers plan to continue working on the project on their own time, though the progress
will likely be slower. We plan to mitigate this risk by recruiting additional committers.
>>> +
>>> +=== Inexperience with Open Source ===
>>> +The initial committers include veteran Apache members (committers and PMC members)
and other developers who have varying degrees of experience with open source projects. All
have been involved with source code that has been released under an open source license, and
several also have experience developing code with an open source development process.
>>> +
>>> +=== Homogenous Developers ===
>>> +The initial committers are employed by a number of companies, including MapR
Technologies, Concurrent and Drawn to Scale. We are committed to recruiting additional committers
from other companies.
>>> +
>>> +=== Reliance on Salaried Developers ===
>>> +It is expected that Drill development will occur on both salaried time and on
volunteer time, after hours. The majority of initial committers are paid by their employer
to contribute to this project. However, they are all passionate about the project, and we
are confident that the project will continue even if no salaried developers contribute to
the project. We are committed to recruiting additional committers including non-salaried developers.
>>> +
>>> +=== Relationships with Other Apache Products ===
>>> +As mentioned in the Alignment section, Drill is closely integrated with Hadoop,
Avro, Hive and HBase in a numerous ways. For example, Drill data lives inside a Hadoop environment
(Drill operates on in situ data). We look forward to collaborating with those communities,
as well as other Apache communities.
>>> +
>>> +=== An Excessive Fascination with the Apache Brand ===
>>> +Drill solves a real problem that many organizations struggle with, and has been
proven within Google to be of significant value. The architecture is based on academic and
industry research. Our rationale for developing Drill as an Apache project is detailed in
the Rationale section. We believe that the Apache brand and community process will help us
attract more contributors to this project, and help establish ubiquitous APIs. In addition,
establishing consensus among users and developers of a Dremel-like tool is a key requirement
for success of the project.
>>> +
>>> +== Documentation ==
>>> +Drill is inspired by Google's Dremel. Google has published a [[http://research.google.com/pubs/pub36632.html|paper]]
highlighting Dremel's innovative nested column-based data format and execution engine.
>>> +
>>> +== Initial Source ==
>>> +The requirement and design documents are currently stored in MapR Technologies'
source code repository. They will be checked in as part of the initial code dump. Check out
the [[attachment:Drill slides.pdf|attached slides]].
>>> +
>>> +== Cryptography ==
>>> +Drill will eventually support encryption on the wire. This is not one of the
initial goals, and we do not expect Drill to be a controlled export item due to the use of
encryption.
>>> +
>>> +== Required Resources ==
>>> +
>>> +=== Mailing List ===
>>> + * drill-private
>>> + * drill-dev
>>> + * drill-user
>>> +
>>> +=== Subversion Directory ===
>>> +Git is the preferred source control system: git://git.apache.org/drill
>>> +
>>> +=== Issue Tracking ===
>>> +JIRA Drill (DRILL)
>>> +
>>> +== Initial Committers ==
>>> + * Tomer Shiran <tshiran at maprtech dot com>
>>> + * Ted Dunning <tdunning at apache dot org>
>>> + * Jason Frantz <jfrantz at maprtech dot com>
>>> + * MC Srivas <mcsrivas at maprtech dot com>
>>> + * Chris Wensel <chris and concurrentinc dot com>
>>> + * Keys Botzum <kbotzum at maprtech dot com>
>>> + * Gera Shegalov <gshegalov at maprtech dot com>
>>> + * Ryan Rawson <ryan at drawntoscale dot com>
>>> +
>>> +== Affiliations ==
>>> +The initial committers are employees of MapR Technologies, Drawn to Scale and
Concurrent. The nominated mentors are employees of MapR Technologies, Lucid Imagination and
Nokia.
>>> +
>>> +== Sponsors ==
>>> +
>>> +=== Champion ===
>>> +Ted Dunning (tdunning at apache dot org)
>>> +
>>> +=== Nominated Mentors ===
>>> + * Ted Dunning <tdunning at apache dot org> – Chief Application Architect
at MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
>>> + * Grant Ingersoll <grant at lucidimagination dot com> – Chief Scientist
at Lucid Imagination, Committer for Lucene, Mahout and other projects.
>>> + * Isabel Drost <isabel at apache dot org> – Software Developer at Nokia
Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
>>> +
>>> +=== Sponsoring Entity ===
>>> +Incubator
>>> +
>>> 
>> 
> 


Mime
View raw message