drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: Storage Plugin Config for XML
Date Mon, 02 Mar 2015 17:36:20 GMT
Even beyond the issue of types, there are structures that are expressible
in XML that do not fit into a database model well, even one like Drill that
supports complex data. The primary issue is text stored between opening and
closing tags. I don't think these features of XML are commonly used by
systems that leverage XML for data storage, but consider the HTML document
below.

<html>
<body>
<p> This is a paragraph </p>
<p>
This is another paragraph, it has formatting like <b>bold</b>,
<i>italics</i>, and <em>emphasized</em> text in it.
</p>

</body>
</html>

This very simple example shows that there is information expressible in XML
that does not fit well in a database model, but could possibly be expected
to be query-able if we say that we support XML. If we want to represent
both paragraphs, we could put them in a repeated map at the schema position
'p'. To store the text we could put in in a field like 'innerHTML'. However
if a user would want to query the nested structure of their paragraphs, we
would need to keep track of the bold, italics and emphasis tags in the
nested structure as well. There is raw text that appears on both sides as
well as in-between all of the nested tags. Is the relative positioning of
these elements something we need to preserve as we read the data into
Drill? Currently nested fields are stored in maps defined to be unordered,
and even if we assigned order, we don't have a good way to represent the
way these elements are linked together with raw text.

This is why I asked about schema definition systems. These are more rigid
and designed specifically for the task of object persistence, and not
markup. We need to look at the kind of data people are interested in
analyzing. If we start claiming to support XML, someone might throw a bunch
of XHTML documents at Drill and expect it to read all of the documents and
preserve complete information fidelity, which I think would require a lot
of work.

-Jason

On Sun, Mar 1, 2015 at 11:27 PM, Adam Gilmore <dragoncurve@gmail.com> wrote:

> I would imagine you'd have to read all XML as a string unless an XSD was
> provided, which would allow you to infer the types.  Still be easy enough
> to cast to the types you need, similar to JSON in the all text mode.
>
> On Wed, Feb 25, 2015 at 5:41 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> >
> > To help with this, I just added some pretty ratty capability to log-synth
> > to generate XML data.
> >
> > You might try generating some data for the customer and then they can
> > point and say "more like this" and "less like that".
> >
> > This will give you pretty realistic sample data without security issues.
> >
> >
> >
> > On Wed, Feb 25, 2015 at 5:28 AM, Chad Smykay <csmykay@mapr.com> wrote:
> >
> >> I have a customer that will be using Drill in production for an ETL
> >> engine use case.  I will try and get example data but the customer they
> are
> >> supporting is EXTREMELY sensitive in sharing any data, even sample data.
> >> If I can find out what schema's they plan on using it would be a could
> >> initial use case.  Thanks for the responses.
> >> --
> >> Kind Regards,
> >> Chad Smykay  |  Solutions Architect  |  M: 210.273.2344
> >>   mapr.com <http://www.mapr.com>
> >>
> >>
> >>
> >> Jacques Nadeau wrote:
> >>
> >> Not yet.  That being said, I think someone could make something pretty
> >> quick since there is an XML extension for Jackson that would plug
> nicely in
> >> next to our current json reader.
> >>
> >> On Mon, Feb 23, 2015 at 10:53 AM, Chad Smykay <csmykay@mapr.com> wrote:
> >>
> >>> Here is why I ask:
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Contribution+Ideas
> >>>
> >>> States:
> >>> Support for new file format readers/writers
> >>> Currently Drill supports text, JSON and Parquet file formats natively
> >>> when interacting with file system. More readers/writers can be
> introduced
> >>> by implementing custom storage plugins. Example formats include below.
> >>> AVRO
> >>> Sequence
> >>> RC
> >>> ORC
> >>> Protobuf
> >>> XML
> >>> Thrift
> >>> ....
> >>>
> >>> The word here being "custom storage plugins"  So I thought maybe
> someone
> >>> has already taken a crack and setting one up before I try.
> >>>
> >>> --
> >>> Kind Regards,
> >>> Chad Smykay  |  Solutions Architect  |  M: 210.273.2344
> >>>   mapr.com <http://www.mapr.com>
> >>>
> >>>
> >>>
> >>> Jahagirdar, Madhu wrote:
> >>>
> >>>  I am also interested in the storage plugin for XML.
> >>>
> >>>   From: Chad Smykay
> >>> Reply-To: "user@drill.apache.org", "csmykay@mapr.com"
> >>> Date: Tuesday, 24 February 2015 12:19 am
> >>> To: "user@drill.apache.org"
> >>> Subject: Storage Plugin Config for XML
> >>>
> >>>   Does anyone know of a working config for XML based files and
> >>> extensions?
> >>> --
> >>> Kind Regards,
> >>> Chad Smykay  |  Solutions Architect  |  M: 210.273.2344
> >>>   mapr.com <http://www.mapr.com>
> >>>
> >>>
> >>> ------------------------------
> >>> The information contained in this message may be confidential and
> >>> legally protected under applicable law. The message is intended solely
> for
> >>> the addressee(s). If you are not the intended recipient, you are hereby
> >>> notified that any use, forwarding, dissemination, or reproduction of
> this
> >>> message is strictly prohibited and may be unlawful. If you are not the
> >>> intended recipient, please contact the sender by return e-mail and
> destroy
> >>> all copies of the original message.
> >>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message