drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <asi...@maprtech.com>
Subject Re: Estimating cost for fs tables, table names after joins and executing logical plans
Date Sat, 23 May 2015 17:31:18 GMT
correction: 4 (b) query should have a non-equality condition instead of
   select n1.n_nationkey from nation n1, (select min(n_regionkey) as
min_regionkey from nation n) n2
               where n1.n_regionkey < n2.min_regionkey;

On Sat, May 23, 2015 at 10:28 AM, Aman Sinha <asinha@maprtech.com> wrote:

> 1.  Yes, certainly the cost estimation is applied for file system sources
> as well.  For Parquet the row count is
>       extracted from the metadata available in the files.  For text (csv,
> json), this is estimated based on the size
>       of the files and a configurable
> *store.text.estimated_row_size_bytes* parameter.   The default value is
> 100
>       bytes per row, but you can change it.
> 2.  There are 2 things in your query that need special handling in Drill
> because it does not rely on a centralized
>       schema.
>       First, the presence of '*' wildcard in the SELECT list - the '*' is
> expanded at run-time unlike the schema-based
>      systems where it is expanded at the initial query planning time.
> Second, whenever there is a join between 2
>       or more tables, the columns in the query need to be qualified by the
> table name because a column  could
>       potentially be present in either of the tables.
>       The following query which explicitly specifies the qualified columns
> in SELECT clause should work:
>        select r_regionkey from (
>            select r.r_regionkey, n.n_regionkey from region r join nation n
> on r.r_regionkey = n.n_regionkey)
>  3.  You might want to file a JIRA with the exact reproducible test case.
>  4.   I believe the link you sent for the plan document is quite old.  The
> best place for the latest documentation is
>        drill.apache.org .
>        Non-equality joins are supported in 2 situations:
>         a)  if there is another equality join between the same 2 tables.
>  e.g
>            select n1.n_nationkey from nation n1, nation n2
>                where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
> n2.n_regionkey
>         b) if one side of the join is a scalar subquery:
>             select n1.n_nationkey from nation n1, (select min(n_regionkey)
> as min_regionkey from nation n) n2
>                where n1.n_regionkey = n2.min_regionkey;
>          Pure cartesian joins are currently not supported.
> Aman
> On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <pietraswithin@gmail.com>
> wrote:
>> Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
>> questions/issues:
>> 1. For query cost estimation one usually needs some additional
>> information about a table such as the number of rows. Is the cost
>> estimation implemented for fs sources as well? If yes, how is the metadata
>> extracted and cached? From my understanding some formats like parquet store
>> it in the file footer, but what about json or csv files? Can this
>> information be queried/retrieved somehow by the user?
>> 2. I’ve been working with the following query:
>> $q = select * from region join nation on region.R_REGIONKEY =
>> nation.N_REGIONKEY;
>> where region and nation are the sample data files imported into a dfs.tmp
>> schema.
>> running queries like
>> select R_REGIONKEY from ($q);
>> results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
>> like select R_REGIONKEY from (SELECT * FROM region); work fine, as well as
>> saving the result of the join with CREATE TABLE and then replacing $q with
>> the saved table’s name. Why is that and what are the rules for renaming
>> columns in join queries?
>> 3. I’ve been trying to execute a logical plan using the web interface. It
>> works fine with a simple scan - project query, but when trying to use the
>> output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
>> the following error:
>> SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
>> property-based creators: already had [constructor for
>> org.apache.drill.common.logical.data.Join,  ...
>> the whole logical query and full error message are at
>> https://gist.github.com/pyetras/bf625b6697de62284996
>> 4. What are the supported conditions for joins? The sql interface seems
>> to support only (e1 == e2 [AND])*, but the logical operator reference at
>> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
>> mentions other relations and also cartesian joins. Are those simply not
>> implemented for the sql parser or not supported in Drill at all?
>> Sorry for the long read and thanks for your assistance,
>> --
>> Piotr Sokólski

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message