drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...
Date Sat, 18 Feb 2017 02:30:28 GMT
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/594
  
    The bug here is fundamental to the way Drill works with JSON. We already had an extensive
discussion around this area in another PR. The problem is that JSON supports a null type which
is independent of all other types. In JSON, a null is not a "null int" or a "null string"
-- it is just null.
    
    Drill must infer a type for a field. This leads to all kinds of grief when a file contains
a run of nulls before the real value:
    
    {code}
    { id: 1, b: null }
    ...
    { id: 80000, b: "gee, I'm a string!" }
    {code}
    
    Drill must do something with the leading values. "b" is a null... what? Int? String?
    
    We've had many bugs in this area. The bugs are not just code bugs, they represent a basic
incompatibility between Drill and JSON.
    
    This fix is yet another attempt to work around the limitation, but cannot overcome the
basic incompatibility.
    
    What we are doing, it seems, is building a list of fields that have seen only null values,
deferring action on those fields until later. That works fine if "later" occurs in the same
record batch. It is not clear what happens if we get to the end of the batch (as in the example
above), but have never seen the type of the field: what type of vector do we create?
    
    There are several solutions.
    
    One is to have a "null" type in Drill. When we see the initial run of nulls, we simply
create a field of the "null" type. We have type conversion rules that say that a "null" vector
can be coerced into any other type when we ultimately see the type. (And, if we don't see
a type in one batch, we can pass the null vector along upstream for later reconciliation.)
This is a big change; too big for a bug fix.
    
    Another solution, used here, is to keep track of "null only" fields, to defer the decision
for later. That has a performance impact.
    
    A third solution is to go ahead and create a vector of any type, keep setting its values
to null (as if we had already seen the field type), but be ready to discard that vector and
convert it to the proper type once we see that type. In this way, we treat null fields just
as any other up to the point where we realize we have a type conflict. Only then do we check
the "null only" map and decide we can quietly convert the vector type to the proper type.
    
    These are the initial thoughts. I'll add more nuanced comments as I review the code in
more detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message