Added: drill/site/trunk/content/drill/docs/persistent-configuration-storage/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/persistent-configuration-storage/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/persistent-configuration-storage/index.html (added) +++ drill/site/trunk/content/drill/docs/persistent-configuration-storage/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,175 @@ + + + + + + + + +Persistent Configuration Storage - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Persistent Configuration Storage

+ +
+ +

Drill stores persistent configuration data in a persistent configuration store +(PStore). This data is encoded in JSON or Protobuf format. Drill can use the +local file system, ZooKeeper, HBase, or MapR-DB to store this data. The data +stored in a PStore includes state information for storage plugins, query +profiles, and ALTER SYSTEM settings. The default type of PStore configured +depends on the Drill installation mode.

+ +

The following table provides the persistent storage mode for each of the Drill +modes:

+ +
ModeDescription
EmbeddedDrill stores persistent data in the local file system.
You cannot modify the PStore location for Drill in embedded mode.
DistributedDrill stores persistent data in ZooKeeper, by default.
You can modify where ZooKeeper offloads data,
or you can change the persistent storage mode to HBase or MapR-DB.
+ + +

Note: Switching between storage modes does not migrate configuration data.

+ +

ZooKeeper for Persistent Configuration Storage

+ +

To make Drill installation and configuration simple, Drill uses ZooKeeper to +store persistent configuration data. The ZooKeeper PStore provider stores all +of the persistent configuration data in ZooKeeper except for query profile +data.

+ +

The ZooKeeper PStore provider offloads query profile data to the +${DRILL_LOG_DIR:-/var/log/drill} directory on Drill nodes. If you want the +query profile data stored in a specific location, you can configure where +ZooKeeper offloads the data.

+ +

To modify where the ZooKeeper PStore provider offloads query profile data, +configure the sys.store.provider.zk.blobroot property in the drill.exec +block in <drill_installation_directory>/conf/drill-override.conf on each +Drill node and then restart the Drillbit service.

+ +

Example

+
drill.exec: {
+ cluster-id: "my_cluster_com-drillbits",
+ zk.connect: "<zkhostname>:<port>",
+ sys.store.provider.zk.blobroot: "maprfs://<directory to store pstore data>/"
+}
+
+

Issue the following command to restart the Drillbit on all Drill nodes:

+
maprcli node services -name drill-bits -action restart -nodes <node IP addresses separated by a space>
+
+

HBase for Persistent Configuration Storage

+ +

To change the persistent storage mode for Drill, add or modify the +sys.store.provider block in <drill_installation_directory>/conf/drill- +override.conf.

+ +

Example

+
sys.store.provider: {
+    class: "org.apache.drill.exec.store.hbase.config.HBasePStoreProvider",
+    hbase: {
+      table : "drill_store",
+      config: {
+      "hbase.zookeeper.quorum": "<ip_address>,<ip_address>,<ip_address >,<ip_address>",
+      "hbase.zookeeper.property.clientPort": "2181"
+      }
+    }
+  },
+
+

MapR-DB for Persistent Configuration Storage

+ +

The MapR-DB plugin will be released soon. You can compile Drill from +source to try out this +new feature.

+ +

If you have MapR-DB in your cluster, you can use MapR-DB for persistent +configuration storage. Using MapR-DB to store persistent configuration data +can prevent memory strain on ZooKeeper in clusters running heavy workloads.

+ +

To change the persistent storage mode to MapR-DB, add or modify the +sys.store.provider block in <drill_installation_directory>/conf/drill- +override.conf on each Drill node and then restart the Drillbit service.

+ +

Example

+
sys.store.provider: {
+class: "org.apache.drill.exec.store.hbase.config.HBasePStoreProvider",
+hbase: {
+  table : "/tables/drill_store",
+    }
+},
+
+

Issue the following command to restart the Drillbit on all Drill nodes:

+
maprcli node services -name drill-bits -action restart -nodes <node IP addresses separated by a space>
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/planning-and-execution-options/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/planning-and-execution-options/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/planning-and-execution-options/index.html (added) +++ drill/site/trunk/content/drill/docs/planning-and-execution-options/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,123 @@ + + + + + + + + +Planning and Execution Options - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Planning and Execution Options

+ +
+ +

You can set Drill query planning and execution options per cluster, at the +system or session level. Options set at the session level only apply to +queries that you run during the current Drill connection. Options set at the +system level affect the entire system and persist between restarts. Session +level settings override system level settings.

+ +

Querying Planning and Execution Options

+ +

You can run the following query to see a list of the system and session +planning and execution options:

+
SELECT name FROM sys.options WHERE type in ('SYSTEM','SESSION');
+
+

Configuring Planning and Execution Options

+ +

Use theALTER SYSTEM or ALTER SESSION commands to set options. Typically, +you set the options at the session level unless you want the setting to +persist across all sessions.

+ +

The following table contains planning and execution options that you can set +at the system or session level:

+ +
Option nameDefault valueDescription
exec.errors.verbose

false

This option enables or disables the verbose message that Drill returns when a query fails. When enabled, Drill provides additional information about failed queries.

exec.max_hash_table_size1073741824The default maximum size for hash tables.
exec.min_hash_table_size65536The default starting size for hash tables. Increasing this size is useful for very large aggregations or joins when you have large amounts of memory for Drill to use. Drill can spend a lot of time resizing the hash table as it finds new data. If you have large data sets, you can increase this hash table size to increase performance.
planner.add_producer_consumer

false

This option enables or disables a secondary reading thread that works out of band of the rest of the scanning fragment to prefetch data from disk. If you interact with a certain type of storage medium that is slow or does not prefetch much data, this option tells Drill to add a producer consumer reading thread to the operati on. Drill can then assign one thread that focuses on a single reading fragment.

If Drill is using memory, you can disable this option to get better performance. If Drill is using disk space, you should enable this option and set a reasonable queue size for the planner.producer_consumer_queue_size option.

planner.broadcast_threshold1000000Threshold, in terms of a number of rows, that determines whether a broadcast join is chosen for a query. Regardless of the setting of the broadcast_join option (enabled or disabled), a broadcast join is not chosen unless the right side of the join is estimated to contain fewer rows than this threshold. The intent of this option is to avoid broadcasting too many rows for join purposes. Broadcasting involves sending data across nodes and is a network-intensive operation. (The "right side" of the join, which may itself be a join or simply a table, is determined by cost-based optimizations and heuristics during physical planning.)

planner.enable_broadcast_join
planner.enable_hashagg
planner.enable_hashjoin
planner.enable_mergejoin
planner.enable_multiphase_agg
planner.enable_streamagg

true

These options enable or disable specific aggregation and join operators for queries. These operators are all enabled by default and in general should not be disabled.

Hash aggregation and hash join are hash-based operations. Streaming aggregation and merge join are sort-based operations. Both hash-based and sort-based operations consume memory; however, currently, hash-based operations do not spill to disk as needed, but the sort-based operations do. If large hash operations do not fit in memory on your system, you may need to disable these operations. Queries will continue to run, using alternative plans.

planner.producer_consumer_queue_size10Determines how much data to prefetch from disk (in record batches) out of band of query execution. The larger the queue size, the greater the amount of memory that the queue and overall query execution consumes.
planner.slice_target100000The number of records manipulated within a fragment before Drill parallelizes them.

planner. width.max_per_node

The default depends on the number of cores on each node.

In this context "width" refers to fanout or distribution potential: the ability to run a query in parallel across the cores on a node and the nodes on a cluster.

A physical plan consists of intermediate operations, known as query "fragments," that run concurrently, yielding opportunities for parallelism above and below each exchange operator in the plan. An exchange operator represents a breakpoint in the execution flow where processing can be distributed. For example, a single-process scan of a file may flow into an exchange operator, followed by a multi-process aggregation fragment.

The maximum width per node defines the maximum degree of parallelism for any fragment of a query, but the setting applies at the level of a single node in the cluster.

The default maximum degree of parallelism per node is calculated as follows, with the theoretical maximum automatically scaled back (and rounded down) so that only 70% of the actual available capacity is taken into account:

+ +

For example, on a single-node test system with 2 cores and hyper-threading enabled:

+ +

When you modify the default setting, you can supply any meaningful number. The system does not automatically scale down your setting.

planner.width.max_per_query1000

The max_per_query value also sets the maximum degree of parallelism for any given stage of a query, but the setting applies to the query as executed by the whole cluster (multiple nodes). In effect, the actual maximum width per query is the minimum of two values:

+ +

For example, on a 4-node cluster where width.max_per_node is set to 6 and width.max_per_query is set to 30:

+ +

In this case, the effective maximum width per query is 24, not 30.

store.format Output format for data that is written to tables with the CREATE TABLE AS (CTAS) command.
store.json.all_text_mode

false

This option enables or disables text mode. When enabled, Drill reads everything in JSON as a text object instead of trying to interpret data types. This allows complicated JSON to be read using CASE and CAST.

store.parquet.block-size

536870912

Target size for a parquet row group, which should be equal to or less than the configured HDFS block size.
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/ports-used-by-drill/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/ports-used-by-drill/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/ports-used-by-drill/index.html (added) +++ drill/site/trunk/content/drill/docs/ports-used-by-drill/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,95 @@ + + + + + + + + +Ports Used by Drill - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Ports Used by Drill

+ +
+ +

The following table provides a list of the ports that Drill uses, the port +type, and a description of how Drill uses the port:

+ +
PortTypeDescription
8047TCPNeeded for the Drill Web UI.
31010TCPUser port address. Used between nodes in a Drill cluster.
Needed for an external client, such as Tableau, to connect into the
cluster nodes. Also needed for the Drill Web UI.
31011TCPControl port address. Used between nodes i n a Drill cluster.
Needed for multi-node installation of Apache Drill.
31012TCPData port address. Used between nodes in a Drill cluster.
Needed for multi-node installation of Apache Drill.
46655UDPUsed for JGroups and Infinispan. Needed for multi-node installation of Apache Drill.
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/progress-reports/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/progress-reports/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/progress-reports/index.html (added) +++ drill/site/trunk/content/drill/docs/progress-reports/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,97 @@ + + + + + + + + +Progress Reports - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Progress Reports

+ +
+ +

Review the following Apache Drill progress reports for a summary of issues, +progression of the project, summary of mailing list discussions, and events:

+ + +
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/project-bylaws/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/project-bylaws/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/project-bylaws/index.html (added) +++ drill/site/trunk/content/drill/docs/project-bylaws/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,260 @@ + + + + + + + + +Project Bylaws - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Project Bylaws

+ +
+ +

Introduction

+ +

This document defines the bylaws under which the Apache Drill project +operates. It defines the roles and responsibilities of the project, who may +vote, how voting works, how conflicts are resolved, etc.

+ +

Drill is a project of the Apache Software +Foundation. The foundation holds the +copyright on Apache code including the code in the Drill codebase. The +foundation FAQ explains the +operation and background of the foundation.

+ +

Drill is typical of Apache projects in that it operates under a set of +principles, known collectively as the Apache Way. If you are new to Apache +development, please refer to the Incubator +project for more information on how Apache +projects operate.

+ +

Roles and Responsibilities

+ +

Apache projects define a set of roles with associated rights and +responsibilities. These roles govern what tasks an individual may perform +within the project. The roles are defined in the following sections.

+ +

Users

+ +

The most important participants in the project are people who use our +software. The majority of our contributors start out as users and guide their +development efforts from the user's perspective.

+ +

Users contribute to the Apache projects by providing feedback to contributors +in the form of bug reports and feature suggestions. As well, users participate +in the Apache community by helping other users on mailing lists and user +support forums.

+ +

Contributors

+ +

All of the volunteers who are contributing time, code, documentation, or +resources to the Drill Project. A contributor that makes sustained, welcome +contributions to the project may be invited to become a committer, though the +exact timing of such invitations depends on many factors.

+ +

Committers

+ +

The project's committers are responsible for the project's technical +management. Committers have access to a specified set of subproject's code +repositories. Committers on subprojects may cast binding votes on any +technical discussion regarding that subproject.

+ +

Committer access is by invitation only and must be approved by lazy consensus +of the active PMC members. A Committer is considered emeritus by his or her +own declaration or by not contributing in any form to the project for over six +months. An emeritus committer may request reinstatement of commit access from +the PMC which will be sufficient to restore him or her to active committer +status.

+ +

Commit access can be revoked by a unanimous vote of all the active PMC members +(except the committer in question if he or she is also a PMC member).

+ +

All Apache committers are required to have a signed Contributor License +Agreement (CLA) on file with the +Apache Software Foundation. There is a Committer +FAQ which provides more details on +the requirements for committers.

+ +

A committer who makes a sustained contribution to the project may be invited +to become a member of the PMC. The form of contribution is not limited to +code. It can also include code review, helping out users on the mailing lists, +documentation, etc.

+ +

Project Management Committee

+ +

The PMC is responsible to the board and the ASF for the management and +oversight of the Apache Drill codebase. The responsibilities of the PMC +include

+ + + +

Membership of the PMC is by invitation only and must be approved by a lazy +consensus of active PMC members. A PMC member is considered emeritus by his +or her own declaration or by not contributing in any form to the project for +over six months. An emeritus member may request reinstatement to the PMC, +which will be sufficient to restore him or her to active PMC member.

+ +

Membership of the PMC can be revoked by an unanimous vote of all the active +PMC members other than the member in question.

+ +

The chair of the PMC is appointed by the ASF board. The chair is an office +holder of the Apache Software Foundation (Vice President, Apache Drill) and +has primary responsibility to the board for the management of the projects +within the scope of the Drill PMC. The chair reports to the board quarterly on +developments within the Drill project.

+ +

The term of the chair is one year. When the current chair's term is up or if +the chair resigns before the end of his or her term, the PMC votes to +recommend a new chair using lazy consensus, but the decision must be ratified +by the Apache board.

+ +

Decision Making

+ +

Within the Drill project, different types of decisions require different forms +of approval. For example, the previous section describes several decisions +which require 'lazy consensus' approval. This section defines how voting is +performed, the types of approvals, and which types of decision require which +type of approval.

+ +

Voting

+ +

Decisions regarding the project are made by votes on the primary project +development mailing list +dev@drill.apache.org. Where necessary, PMC +voting may take place on the private Drill PMC mailing list +private@drill.apache.org. Votes are clearly +indicated by subject line starting with [VOTE]. Votes may contain multiple +items for approval and these should be clearly separated. Voting is carried +out by replying to the vote mail. Voting may take four flavors.

+ +

Vote

+1

'Yes,' 'Agree,' or 'the action should be performed.' In general, this vote also indicates a willingness on the behalf of the voter in 'making it happen'.

+0

This vote indicates a willingness for the action under consideration to go ahead. The voter, however will not be able to help.

-0

This vote indicates that the voter does not, in general, agree with the proposed action but is not concerned enough to prevent the action going ahead.

-1< /p>

This is a negative vote. On issues where consensus is required, this vote counts as a veto. All vetoes must contain an explanation of why the veto is appropriate. Vetoes with no explanation are void. It may also be appropriate for a -1 vote to include an alternative course of action.

+ +

All participants in the Drill project are encouraged to show their agreement +with or against a particular action by voting. For technical decisions, only +the votes of active committers are binding. Non binding votes are still useful +for those with binding votes to understand the perception of an action in the +wider Drill community. For PMC decisions, only the votes of PMC members are +binding.

+ +

Voting can also be applied to changes already made to the Drill codebase. +These typically take the form of a veto (-1) in reply to the commit message +sent when the commit is made. Note that this should be a rare occurrence. All +efforts should be made to discuss issues when they are still patches before +the code is committed.

+ +

Approvals

+ +

These are the types of approvals that can be sought. Different actions require +different types of approvals.

+ +

Approval Type

Consensus

For this to pass, all voters with binding votes must vote and there can be no binding vetoes (-1). Consensus votes are rarely required due to the impracticality of getting all eligible voters to cast a vote.

Lazy Consensus

Lazy consensus requires 3 binding +1 votes and no binding vetoes.

Lazy Majority

A lazy majority vote requires 3 binding +1 votes and more binding +1 votes that -1 votes.

Lazy Approval

An a ction with lazy approval is implicitly allowed unless a -1 vote is received, at which time, depending on the type of action, either lazy majority or lazy consensus approval must be obtained.

+ + +

Vetoes

+ +

A valid, binding veto cannot be overruled. If a veto is cast, it must be +accompanied by a valid reason explaining the reasons for the veto. The +validity of a veto, if challenged, can be confirmed by anyone who has a +binding vote. This does not necessarily signify agreement with the veto - +merely that the veto is valid.

+ +

If you disagree with a valid veto, you must lobby the person casting the veto +to withdraw his or her veto. If a veto is not withdrawn, the action that has +been vetoed must be reversed in a timely manner.

+ +

Actions

+ +

This section describes the various actions which are undertaken within the +project, the corresponding approval required for that action and those who +have binding votes over the action. It also specifies the minimum length of +time that a vote must remain open, measured in business days. In general votes +should not be called at times when it is known that interested members of the +project will be unavailable.

+ +

Action

Description

Approval

Binding Votes

Minimum Length

Code Change

A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc.

Consensus approval of active committers, with a minimum of one +1. The code can be committed after the first +1

Active committers

1

Release Plan

Define s the timetable and actions for a release. The plan also nominates a Release Manager.

Lazy majority

Active committers

3

Product Release

When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project.

Lazy Majority

Active PMC members

3

Adoption of New Codebase

When the codebase for an existing, released product is to be replaced with an alternative codebase. If such a vote fails to gain approval, the existing cod e base will continue. This also covers the creation of new sub-projects within the project.

2/3 majority

Active PMC members

6

New Committer

When a new committer is proposed for the project.

Lazy consensus

Active PMC members

3

New PMC Member

When a committer is proposed for the PMC.

Lazy consensus

Active PMC members

3

Committer Removal

When removal of commit privileges is sought. Note: Such actions will also be referred to the ASF board by the PMC chair.

Consensus

Active PMC members (excluding the committer in question if a member of the PMC).

6

PMC Member Removal

When removal of a PMC member is sought. Note: Such actions will also be referred to the ASF board by the PMC chair.

Consensus

Active PMC members (excluding the member in question).

6

Mo difying Bylaws

Modifying this document.

2/3 majority

Active PMC members

6

+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-1-selecting-flat-data/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-1-selecting-flat-data/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-1-selecting-flat-data/index.html (added) +++ drill/site/trunk/content/drill/docs/query-1-selecting-flat-data/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,105 @@ + + + + + + + + +Query 1: Selecting Flat Data - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query 1: Selecting Flat Data

+ +
+ +

A very simple query against the donuts.json file returns the values for the +four "flat" columns (the columns that contain data at the top level only: no +nested data):

+
0: jdbc:drill:zk=local> select id, type, name, ppu
+from dfs.`/Users/brumsby/drill/donuts.json`;
++------------+------------+------------+------------+
+|     id     |    type    |    name    |    ppu     |
++------------+------------+------------+------------+
+| 0001       | donut      | Cake       | 0.55       |
++------------+------------+------------+------------+
+1 row selected (0.248 seconds)
+
+

Note that dfs is the schema name, the path to the file is enclosed by +backticks, and the query must end with a semicolon.

+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-2-using-standard-sql-functions-clauses-and-joins/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-2-using-standard-sql-functions-clauses-and-joins/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-2-using-standard-sql-functions-clauses-and-joins/index.html (added) +++ drill/site/trunk/content/drill/docs/query-2-using-standard-sql-functions-clauses-and-joins/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,158 @@ + + + + + + + + +Query 2: Using Standard SQL Functions, Clauses, and Joins - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query 2: Using Standard SQL Functions, Clauses, and Joins

+ +
+ +

You can use standard SQL clauses, such as WHERE and ORDER BY, to elaborate on +this kind of simple query:

+
0: jdbc:drill:zk=local> select id, type from dfs.`/Users/brumsby/drill/donuts.json`
+where id>0
+order by id limit 1;
+
++------------+------------+
+
+|     id     |    type    |
+
++------------+------------+
+
+| 0001       | donut      |
+
++------------+------------+
+
+1 row selected (0.318 seconds)
+
+

You can also join files (or tables, or files and tables) by using standard +syntax:

+
0: jdbc:drill:zk=local> select tbl1.id, tbl1.type from dfs.`/Users/brumsby/drill/donuts.json` as tbl1
+join
+dfs.`/Users/brumsby/drill/moredonuts.json` as tbl2
+on tbl1.id=tbl2.id;
+
++------------+------------+
+
+|     id     |    type    |
+
++------------+------------+
+
+| 0001       | donut      |
+
++------------+------------+
+
+1 row selected (0.395 seconds)
+
+

Equivalent USING syntax and joins in the WHERE clause are also supported.

+ +

Standard aggregate functions work against JSON data. For example:

+
0: jdbc:drill:zk=local> select type, avg(ppu) as ppu_sum from dfs.`/Users/brumsby/drill/donuts.json` group by type;
+
++------------+------------+
+
+|    type    |  ppu_sum   |
+
++------------+------------+
+
+| donut      | 0.55       |
+
++------------+------------+
+
+1 row selected (0.216 seconds)
+
+0: jdbc:drill:zk=local> select type, sum(sales) as sum_by_type from dfs.`/Users/brumsby/drill/moredonuts.json` group by type;
+
++------------+-------------+
+
+|    type    | sum_by_type |
+
++------------+-------------+
+
+| donut      | 1194        |
+
++------------+-------------+
+
+1 row selected (0.389 seconds)
+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-3-selecting-nested-data-for-a-column/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-3-selecting-nested-data-for-a-column/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-3-selecting-nested-data-for-a-column/index.html (added) +++ drill/site/trunk/content/drill/docs/query-3-selecting-nested-data-for-a-column/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,135 @@ + + + + + + + + +Query 3: Selecting Nested Data for a Column - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query 3: Selecting Nested Data for a Column

+ +
+ +

The following queries show how to access the nested data inside the parts of +the record that are not flat (such as topping). To isolate and return nested +data, use the [n] notation, where n is a number that points to a specific +position in an array. Arrays use a 0-based index, so topping[3] points to +the fourth element in the array under topping, not the third.

+
0: jdbc:drill:zk=local> select topping[3] as top from dfs.`/Users/brumsby/drill/donuts.json`;
+
++------------+
+
+|    top     |
+
++------------+
+
+| {"id":"5007","type":"Powdered Sugar"} |
+
++------------+
+
+1 row selected (0.137 seconds)
+
+

Note that this query produces one column for all of the data that is nested +inside the topping segment of the file. The query as written does not unpack +the id and type name/value pairs. Also note the use of an alias for the +column name. (Without the alias, the default column name would be EXPR$0.)

+ +

Some JSON files store arrays within arrays. If your data has this +characteristic, you can probe into the inner array by using the following +notation: [n][n]

+ +

For example, assume that a segment of the JSON file looks like this:

+
...
+group:
+[
+  [1,2,3],
+
+  [4,5,6],
+
+  [7,8,9]
+]
+...
+
+

The following query would return 6 (the third value of the second inner +array).

+ +

select group[1][2]

+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-4-selecting-multiple-columns-within-nested-data/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-4-selecting-multiple-columns-within-nested-data/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-4-selecting-multiple-columns-within-nested-data/index.html (added) +++ drill/site/trunk/content/drill/docs/query-4-selecting-multiple-columns-within-nested-data/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,110 @@ + + + + + + + + +Query 4: Selecting Multiple Columns Within Nested Data - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query 4: Selecting Multiple Columns Within Nested Data

+ +
+ +

The following query goes one step further to extract the JSON data, selecting +specific id and type data values as individual columns from inside the +topping array. This query is similar to the previous query, but it returns +the id and type values as separate columns.

+
0: jdbc:drill:zk=local> select tbl.topping[3].id as record, tbl.topping[3].type as first_topping
+from dfs.`/Users/brumsby/drill/donuts.json` as tbl;
++------------+---------------+
+|   record   | first_topping |
++------------+---------------+
+| 5007       | Powdered Sugar |
++------------+---------------+
+1 row selected (0.133 seconds)
+
+

This query also introduces a typical requirement for queries against nested +data: the use of a table alias (named tbl in this example). Without the table +alias, the query would return an error because the parser would assume that id +is a column inside a table named topping. As in all standard SQL queries, +select tbl.col means that tbl is the name of an existing table (at least for +the duration of the query) and col is a column that exists in that table.

+
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-data/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-data/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-data/index.html (added) +++ drill/site/trunk/content/drill/docs/query-data/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,149 @@ + + + + + + + + +Query Data - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query Data

+ +
+ +

You can query local and distributed file systems, Hive, and HBase data sources +registered with Drill. If you connected directly to a particular schema when +you invoked SQLLine, you can issue SQL queries against that schema. If you did +not indicate a schema when you invoked SQLLine, you can issue the USE +<schema> statement to run your queries against a particular schema. After you +issue the USE statement, you can use absolute notation, such as +schema.table.column.

+ +

Click on any of the following links for information about various data source +queries and examples:

+ + + +

You may need to use casting functions in some queries. For example, you may +have to cast a string "100" to an integer in order to apply a math function +or an aggregate function.

+ +

You can use the EXPLAIN command to analyze errors and troubleshoot queries +that do not run. For example, if you run into a casting error, the query plan +text may help you isolate the problem.

+
0: jdbc:drill:zk=local> !set maxwidth 10000
+0: jdbc:drill:zk=local> explain plan for select ... ;
+
+

The set command increases the default text display (number of characters). By +default, most of the plan output is hidden.

+ +

You may see errors if you try to use non-standard or unsupported SQL syntax in +a query.

+ +

Remember the following tips when querying data with Drill:

+ + + +

Example: SELECT * FROM dfs.default.`sample_data/my_sample.json`;

+ + +
+ + + + + + + + Added: drill/site/trunk/content/drill/docs/query-stages/index.html URL: http://svn.apache.org/viewvc/drill/site/trunk/content/drill/docs/query-stages/index.html?rev=1651949&view=auto ============================================================================== --- drill/site/trunk/content/drill/docs/query-stages/index.html (added) +++ drill/site/trunk/content/drill/docs/query-stages/index.html Thu Jan 15 05:11:44 2015 @@ -0,0 +1,128 @@ + + + + + + + + +Query Stages - Apache Drill + + + + + + + + + + + + + + + + + + +
+ + + + + +
+

Query Stages

+ +
+ +

Overview

+ +

Apache Drill is a system for interactive analysis of large-scale datasets. It +was designed to allow users to query across multiple large big data systems +using traditional query technologies such as SQL. It is built as a flexible +framework to support a wide variety of data operations, query languages and +storage engines.

+ +

Query Parsing

+ +

A Drillbit is capable of parsing a provided query into a logical plan. In +theory, Drill is capable of parsing a large range of query languages. At +launch, this will likely be restricted to an enhanced SQL2003 language.

+ +

Physical Planning

+ +

Once a query is parsed into a logical plan, a Drillbit will then translate the +plan into a physical plan. The physical plan will then be optimized for +performance. Since plan optimization can be computationally intensive, a +distributed in-memory cache will provide LRU retrieval of previously generated +optimized plans to speed query execution.

+ +

Execution Planning

+ +

Once a physical plan is generated, the physical plan is then rendered into a +set of detailed executional plan fragments (EPFs). This rendering is based on +available resources, cluster load, query priority and detailed information +about data distribution. In the case of large clusters, a subset of nodes will +be responsible for rendering the EPFs. Shared state will be managed through +the use of a distributed in-memory cache.

+ +

Execution Operation

+ +

Query execution starts with each Drillbit being provided with one or more EPFs +associated with query execution. A portion of these EPFs may be identified as +initial EPFs and thus they are executed immediately. Other EPFs are executed +as data flows into them.

+
+ + + + + + + +