drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: [EXT] Re: Food for thought about intra-document operation
Date Wed, 29 Nov 2017 17:36:21 GMT
Damien,
for the intra-document operations, it would be useful to add support for
LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN)
should address the use case you have.  I have filed a JIRA for this:
https://issues.apache.org/jira/browse/DRILL-5999.

-Aman

On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta <damien.profeta@amadeus.com
> wrote:

> Hello Aman,
>
> AsterixDb seems to follow the standard SQL with a few minor modifications
> and add functions to ease aggregations (array_count, array_avg…)
>
> That would tend to confirm at least that the support of unnest is a good
> idea to improve Drill.
>
> Best regards
>
> Damien
>
> **
>
> On 09/25/2017 07:53 PM, Aman Sinha wrote:
>
>> Damien,
>> thanks for initiating the discussion..indeed this would be a very useful
>> enhancement.  Currently, Drill provides repeated_contains()  for filtering
>> and repeated_count() for count aggregates on arrays but not the general
>> purpose intra-document operations that you need based on your example.
>> I haven't gone through all the alternatives but in addition to what you
>> have described,  you might also want to look at SQL++ (
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has
>> been
>> adopted by AsterixDB and has syntax extensions to SQL for unstructured
>> data.
>>
>> -Aman
>>
>> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <
>> damien.profeta@amadeus.com>
>> wrote:
>>
>> Hello,
>>>
>>> A few format handled by Drill enable to work with document, meaning
>>> nested
>>> and repeated structure instead of just tables. Json and Parquet are the
>>> two
>>> that come to my mind right now. Document modeling is a great way to
>>> express
>>> complex object and is used a lot in my company. Drill is able to handle
>>> them but unfortunately, it cannot make much computation on it. By
>>> computation I mean, filtering branches of the document, computing
>>> statistics (avg, min, max) on part of the document … That would be very
>>> useful as an analytic tools.
>>>
>>> _What can be done_
>>>
>>> The question then is how to express the computation we want to do on the
>>> document. I have found multiple ways to handle that and I don't really
>>> know
>>> which one is the best hence the mail to expose what I have found to
>>> initiate discussion, maybe.
>>>
>>> First, in we look back at the Dremel paper which is the base of the
>>> parquet format and also one of the example for drill, dremel is adding
>>> the
>>> special keyword "WITHIN" to SQL to specify that the computation has to be
>>> done within a document. What is very powerful with this keyword is that
>>> it
>>> allows you to generate document and doesn't force you to flatten
>>> everything. You can find exemple of it usage in the google successor of
>>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
>>> ery/docs/legacy-nested-repeated.
>>>
>>> But it seems that it was problematic for Google, because they now propose
>>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle
>>> such
>>> computation. I am not familiar with SQL 2011 but it is told in BigQuery
>>> documentation to integrated the keywords for nested and repeated
>>> structure.
>>> You can have a view about how this is done in BigQuery here:
>>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
>>> Basically, what I have seen is that they leverage UNNEST and ARRAY
>>> keyword
>>> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>>>
>>> In Impala, they have added a way to add a subquery on a complex type in
>>> such a way that the subquery only act intra-document. I have no idea if
>>> this is standard SQL or not. In page https://www.cloudera.com/docum
>>> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
>>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
>>> example.
>>>
>>> In Presto, you can apply lambda function to map/array to transform the
>>> structure and apply filter on it. So you have filter, map_filter function
>>> to filter array and map respectively. (cf https://prestodb.io/docs/curre
>>> nt/functions/lambda.html#filter)
>>>
>>> _Example_
>>>
>>> If I want to make a short example, let’s say we have a flight with a
>>> group
>>> of passengers in it. A document would be :
>>>
>>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M
>>> ”},{“age”:15,”gender”:”F”},
>>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>>>
>>> The database would be millions of such document and I want to know the
>>> average age of the male passenger for every flight.
>>>
>>> In Dremel, the query would be something like: select flightnb,
>>> avg(male_age) within record from (select groups.age as male_age from
>>> flight
>>> where group.gender = "M")
>>>
>>> With sql, it would be something like: select flightnb, avg(male_age) from
>>> (array(select g.age as male_age from unnest(group)as g where g.gender =
>>> "M") as male_age)
>>>
>>> With impala it would be something like: select flightnb, avg(male) from
>>> flight, select g.age from groups as g where g.gender = “M” as male
>>>
>>> With presto, it would be something like:  select flightnb, avg(male) from
>>> flight, filter(group,x->x.gender = "M")as male
>>>
>>> I am not sure at all about my SQL queries but it should give you a rough
>>> idea about the different ways to express the inital query.
>>>
>>> So many different ways to express the same query… I would personally go
>>> for the SQL way of expressing things to implement it in Drill, especially
>>> because calcite is already able to parse unnest, array, but that’s only
>>> my
>>> first thought.
>>>
>>> Best regards,
>>>
>>> Damien
>>>
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message