drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] cgivre commented on pull request #2283: DRILL-7979: Self-Closing XML Tags Cause Schema Change Exceptions
Date Mon, 02 Aug 2021 15:29:15 GMT

cgivre commented on pull request #2283:
URL: https://github.com/apache/drill/pull/2283#issuecomment-891119762


   > I started out adding specific, implementation-level comments but I've paused that
to back off and ask: is this really a _self-closing tag_ thing, or is the situation the same
for _any empty element_ that also occurs as a parent element? In my tests on `master`. the
problem is the same for either of the following, which I believe are also equivalent in the
XML spec.
   > 
   > ```
   > <!-- self-closing -->
   > <foo/>
   > 
   > <!-- just empty -->
   > <foo></foo>
   > ```
   > 
   > If I've got right end of the stick here then I suggest that we adjust all the naming
to refer to the "empty element" case, rather than the "self-closing" case.
   > 
   > Next, following on from our comments on Jira and the idea of using maps for this case,
what do you think of the following approach?
   > 
   > 1. When our first encounter with an element `foo` is empty, and therefore ambiguous
in terms of type, we default to the non-leaf case and make it a map.
   > 2. For subsequent parent `foo` elements we return populated maps.  For subsequent
empty `foo` elements we return empty maps.
   > 3. For subsequent leaf elements `<foo>bar</foo>`, which we would normally
map to varchar but where we find that we've already got a map from step 1, we put the element
value into the map under a hardcoded special key, e.g. `{ '__value__': 'bar' }`.
   > 
   > The above will also work in the case when the first element encountered is empty but
has attributes `<foo a='b' />` while the element discarding logic in the present patch
does not discard such elements. If you're not crazy about this it's no problem and I've probably
got a couple more specific remarks to add on the implementation.
   
   @dzamo Thanks for the response.  The real issue is that we don't know the schema as we're
scanning the file, so we have to do the best we can.  The issue is that with the empty fields
(self-closing or otherwise) we don't really know what they are until we see real data.  For
instance, if we decide to make them an empty map, we'll get an error if the next record shows
up as a scalar.  The current approach was to treat empty fields as scalars which then causes
issues if we encounter a map in the next row.
   You asked in an other comment about perhaps treating all empty elements in the same manner.
 There was a specific challenge as to how the self closing tags which is why I made this PR.
 I'm actually working on another project to get the XML reader to download a provided schema
(the XSD link) which would actually solve a lot of issues reading XML.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message