drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Use cases for DFDL
Date Thu, 07 Nov 2019 18:35:02 GMT
Hi All,

One thought to add is that if DFDL defines the file schema, then it would be ideal to use
that schema at plan time as well as run time. Drill's Calcite integration provides means to
do this, though I am personally a bit hazy on the details.

Certainly getting the reader to work is the first step; thanks Charles for the excellent summary.
Then, add the needed Calcite integration to make the schema available to the planner at plan
time.

Thanks,
- Paul

 

    On Thursday, November 7, 2019, 09:58:53 AM PST, Charles Givre <cgivre@gmail.com>
wrote:  
 
 Hi Steve, 
Thanks for responding... Here's how Drill reads a file:

Drill uses what are called "format plugins" which basically read the file in question and
map fields to column vectors.  Note:  Drill supports nested data structures, so a column
could contain a MAP or LIST. 

The basic steps are:
1.  Open the inputstream and read the file
2.  If the schema is known, it is advantageous to define the schema using a schemaBuilder
object in advance and create schemaWriters for each column.  In this case, since we'd be
using DFDL, we do know the schema so we could create the schema BEFORE the data actually gets
read.  If the schema is not known in advance, JSON for instance, Drill can discover the schema
as it is reading the data, by dynamically adding column vectors as data is ingested, but that's
not the case here... 
3.  Once the schema is defined, Drill will then read the file row by row, parse the data,
and assign values to each column vector. 

There are a few more details but that's the essence.  

What would be great is if we could create a function that could directly map a DFDL schema
directly to a Drill SchemaBuilder. (Docs here [1])  Drill does natively support JSON, however,
it would probably be more effective and efficient if there was an InfosetOutputter custom
for Drill.  Ideally, we need some sort of Iterable object so that Drill can map the parsed
fields to the schema.  

If you want to take a look at a relatively simple format plugin take a look here: [2]. This
file is the BatchReader which is where most of the heavy lifting takes place.  This plugin
is for ESRI Shape files and has a mix of pre-defined fields, nested fields and fields that
are defined after reading starts.


[1]: https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md
<https://github.com/apache/drill/blob/9c62bf1a91f611bdefa6f3a99e9dfbdf9b622413/docs/dev/RowSetFramework.md>
[2]: https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
<https://github.com/apache/drill/blob/master/contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java>


I can start a draft PR on the Drill side over the weekend and will share the link to this
list.
Respectfully, 
-- C


> On Nov 5, 2019, at 8:12 AM, Steve Lawrence <stephen.d.lawrence@gmail.com> wrote:
> 
> I definitely agree. Apache Drill seems like a logical place to add
> Daffodil support. And I'm sure many of us, including myself, would be
> happy to provide some time towards this effort.
> 
> The Daffodil API is actually fairly simple and is usually fairly
> straightforward to integrate--most of the complexity comes from the DFDL
> schemas. There's a good "hello world" available [1] that shows more API
> functionality/errors/etc., but the jist of it is:
> 
> 1) Compile a DFDL schema to a data processor:
> 
>  Compiler c = Daffodil.compiler();
>  ProcessorFactory pf = c.compileFile(file);
>  DataProcessor dp = pf.onPath("/");
> 
> 2) Create an input source for the data
> 
>  InputStream is = ...
>  InputSourceDataInputStream in = new InputSourceDataInputStream(is);
> 
> 3) Create an infoset outputter (we have a handful of differnt kinds)
> 
>  JDOMInfosetOutputter out = new JDOMInfosetOutputter();
> 
> 4) Use the DataProcessor to parse the input data to the infoset outputter
> 
>  ParseResult pr = dataProcessor.parse(in, out)
> 
> So I guess the parts that we would need more Drill understanding is what
> the InfosetOutputter (step 3) needs to look like to better integrate
> into Drill. Is there a standard data structure that Drill expects
> representations of data to look like and Drill does the querying on the
> data structure? And is there some sort of schema that Daffodil would
> need to create to describe what this structure looks like so it could
> query it? Perhaps we'd have a custom Drill InfosetOutputter that create
> this data structure, unless Drill already supports XML or JSON.
> 
> Or is it completely up to the Storage Plugin (is that the right term) to
> determine how to take a Drill query and find the appropriate data from
> the data store?
> 
> - Steve
> 
> [1]
> https://github.com/OpenDFDL/examples/blob/master/helloWorld/src/main/java/HelloWorld.java
> 
> 
> On 11/3/19 9:31 AM, Charles Givre wrote:
>> Hi Julian,
>> It seems like there is a beginning of convergence of the minds here.  I went to

>> the Apache Roadshow in DC and that was where I learned about DFDL and 
>> immediately thought this was a really interesting possibility.
>> 
>> I'd love to see if we could foster some collaboration between the various 
>> projects on this.  From the Drill side of things, it would make it SO much 
>> easier to get Drill to read (and by extension query) various data types.  I'd be

>> willing to contribute time from the Drill side, but I definitely will need help 
>> understanding how DFDL works.
>> 
>> --C
>> 
>> 
>> 
>>> On Nov 3, 2019, at 8:01 AM, Julian Feinauer <j.feinauer@pragmaticminds.de

>>> <mailto:j.feinauer@pragmaticminds.de>> wrote:
>>> 
>>> Hi Charles,
>>> this is an interesting idea and in fact we also discussed the same matter for

>>> Calcite at ApacheCon NA.
>>> But, I agree that it would be really powerful together with a complete Runtime

>>> like Drill.
>>> Julian
>>> *Von:*Charles Givre <cgivre@gmail.com <mailto:cgivre@gmail.com>>
>>> *Antworten an:*"users@daffodil.apache.org <mailto:users@daffodil.apache.org>"

>>> <users@daffodil.apache.org <mailto:users@daffodil.apache.org>>
>>> *Datum:*Mittwoch, 30. Oktober 2019 um 19:38
>>> *An:*"Costello, Roger L." <costello@mitre.org <mailto:costello@mitre.org>>
>>> *Cc:*"users@daffodil.apache.org <mailto:users@daffodil.apache.org>" 
>>> <users@daffodil.apache.org <mailto:users@daffodil.apache.org>>
>>> *Betreff:*Re: Use cases for DFDL
>>> +1
>>> 
>>> 
>>>> On Oct 30, 2019, at 2:36 PM, Costello, Roger L. <costello@mitre.org 
>>>> <mailto:costello@mitre.org>> wrote:
>>>> Excellent! Okay, here’s the use case:
>>>> A Daffodil extension could be created for Apache Drill so that you could

>>>> parse any kind of data with Daffodil using a DFDL schema, and then you could

>>>> use ANSI SQL to query the data, join it with other data, do analysis, etc.,

>>>> just as if it came from a database. So, instead of parsing data to XML and

>>>> then using XPath to pull out data, you could instead parse data to Apache

>>>> Drill's data representation and then use ANSI SQL to pull out data, and even

>>>> combine it with other non-Daffodil data types. The advantage for this would

>>>> be that it would make it very easy to enable Drill to query new data types

>>>> (IE simply by using a DFDL schema) and it would enable users to easily query

>>>> this data without having to load it into another system.
>>>> How’s that Charles?
>>>> /Roger
>>>> *From:*Charles Givre <cgivre@gmail.com <mailto:cgivre@gmail.com>>
>>>> *Sent:*Wednesday, October 30, 2019 2:28 PM
>>>> *To:*Costello, Roger L. <costello@mitre.org <mailto:costello@mitre.org>>
>>>> *Cc:*users@daffodil.apache.org <mailto:users@daffodil.apache.org>
>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>> Close... One minor nit is that Drill doesn't use a "query-like" syntax. It
is 
>>>> regular ANSI SQL.  IMHO, I think this. would be a really great collaboration

>>>> of the two communities.
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Oct 30, 2019, at 1:10 PM, Costello, Roger L. <costello@mitre.org

>>>>> <mailto:costello@mitre.org>> wrote:
>>>>> Thanks again Charles. Is the following use case description correct?
>>>>> A Daffodil extension could be created for Apache Drill so that you could

>>>>> parse any kind of data with Daffodil using a DFDL schema, and then you
could 
>>>>> use Apache Drill's query-like syntax and rich capabilities to query parts
of 
>>>>> that data, join it with other data, do analysis, etc., just as if it
came 
>>>>> from a database. So, instead of parsing data to XML and then using XPath
to 
>>>>> pull out data, you could instead parse data to Apache Drill's data 
>>>>> representation and then use Drills rich data-query capabilities to pull
out 
>>>>> data, and even combine it with other non-Daffodil data types. The advantage

>>>>> for this would be that it would make it very easy to enable Drill to
query 
>>>>> new data types (IE simply by using a DFDL schema) and it would enable
users 
>>>>> to easily query this data without having to load it into another system.
>>>>> Is that correct?
>>>>> /Roger
>>>>> *From:*Charles Givre <cgivre@gmail.com <mailto:cgivre@gmail.com>>
>>>>> *Sent:*Wednesday, October 30, 2019 12:19 PM
>>>>> *To:*Costello, Roger L. <costello@mitre.org <mailto:costello@mitre.org>>
>>>>> *Cc:*users@daffodil.apache.org <mailto:users@daffodil.apache.org>
>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>> Not exactly...
>>>>> I was thinking of using DFDL to enable Drill to create a schema for data

>>>>> that Drill cannot read.  If DFDL can be used to describe the schema,
a 
>>>>> plugin could be written for Drill that mirrors this schema and ultimately

>>>>> reads the data files.  Drill wouldn't be populating any database, but
rather 
>>>>> directly querying the data.
>>>>> The advantage for this would be that it would make it very easy to enable

>>>>> Drill to query new data types (IE simply by using a DFDL schema) and
it 
>>>>> would enable users to easily query this data w/o having to load it into

>>>>> another system.  Does that make sense?
>>>>> -- C
>>>>>> On Oct 30, 2019, at 12:13 PM, Costello, Roger L. <costello@mitre.org

>>>>>> <mailto:costello@mitre.org>> wrote:
>>>>>> Thanks Charles. Let me see if I understand the use case correctly.
>>>>>> Use DFDL to parse data to populate a database and then use Apache
Drill to 
>>>>>> query the database.
>>>>>> Is that correct?
>>>>>> /Roger
>>>>>> *From:*Charles Givre <cgivre@gmail.com <mailto:cgivre@gmail.com>>
>>>>>> *Sent:*Wednesday, October 30, 2019 12:01 PM
>>>>>> *To:*users@daffodil.apache.org <mailto:users@daffodil.apache.org>
>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>> To add to this discussion, I'm the PMC chair for Apache Drill. 
I think a 
>>>>>> compelling use case for DFDL would be enabling Drill to use DFDL
to enable 
>>>>>> Drill to query data based on a DFDL schema.  This same concept could
be 
>>>>>> applied to other SQL query engines such as Presto and/or Impala.
>>>>>> IMHO, this would facilitate the analysis of data sets supported by
DFDL.
>>>>>> -- C
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Oct 30, 2019, at 11:53 AM, Costello, Roger L. <costello@mitre.org

>>>>>>> <mailto:costello@mitre.org>> wrote:
>>>>>>> Thanks Mike! I updated the slide:
>>>>>>> <image002.png>
>>>>>>> *From:*Beckerle, Mike <mbeckerle@tresys.com <mailto:mbeckerle@tresys.com>>
>>>>>>> *Sent:*Wednesday, October 30, 2019 11:45 AM
>>>>>>> *To:*users@daffodil.apache.org <mailto:users@daffodil.apache.org>
>>>>>>> *Subject:*[EXT] Re: Use cases for DFDL
>>>>>>> I would not pick on RDF data stores as the target.
>>>>>>> Parsing data to populate a database (any variety) is the actual
case. The 
>>>>>>> fact that we did do one project involving RDF is why I cited
that example 
>>>>>>> in particular but pulling data into any data store/data base
begins with 
>>>>>>> the ability to parse the data, and then process it into suitable
form.
>>>>>>> This is an incomplete list so perhaps this slide title should
be "Example 
>>>>>>> Use Cases for DFDL" ?
>>>>>>> ...mikeb
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:*Costello, Roger L. <costello@mitre.org <mailto:costello@mitre.org>>
>>>>>>> *Sent:*Monday, October 28, 2019 10:41 AM
>>>>>>> *To:*users@daffodil.apache.org 
>>>>>>> <mailto:users@daffodil.apache.org><users@daffodil.apache.org

>>>>>>> <mailto:users@daffodil.apache.org>>
>>>>>>> *Subject:*Use cases for DFDL
>>>>>>> Hi Folks,
>>>>>>> I created a slide of use cases. See below. Do you agree with
the slide? 
>>>>>>> Anything you would add, delete, or change?  /Roger
>>>>>>> <image003.png>
>> 
> 
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message