drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...
Date Thu, 05 Jan 2017 17:41:57 GMT
Github user paul-rogers commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
    @@ -305,12 +306,23 @@ protected boolean setupNewSchema() throws SchemaChangeException
         final NamedExpression flattenExpr = new NamedExpression(popConfig.getColumn(), new
         final ValueVectorReadExpression vectorRead = (ValueVectorReadExpression)ExpressionTreeMaterializer.materialize(flattenExpr.getExpr(),
incoming, collector, context.getFunctionRegistry(), true);
    -    final TransferPair tp = getFlattenFieldTransferPair(flattenExpr.getRef());
    -    if (tp != null) {
    -      transfers.add(tp);
    -      container.add(tp.getTo());
    -      transferFieldIds.add(vectorRead.getFieldId().getFieldIds()[0]);
    +    final FieldReference fieldReference = flattenExpr.getRef();
    +    final TransferPair transferPair = getFlattenFieldTransferPair(fieldReference);
    +    if (transferPair != null) {
    +      final ValueVector flattenVector = transferPair.getTo();
    +      // checks that list has only default ValueVector and replaces resulting ValueVector
to INT typed ValueVector
    +      if (exprs.size() == 0 && flattenVector.getField().getType().equals(Types.LATE_BIND_TYPE))
    +        final MaterializedField outputField = MaterializedField.create(fieldReference.getAsNamePart().getName(),
    +        final ValueVector vector = TypeHelper.getNewVector(outputField, oContext.getAllocator());
    --- End diff --
    The fix appears to be to transform an empty list into an empty list of integers. That
is, Drill does not have the concept of "empty list", only "empty list of type X" and we are
guessing the type to be integer.
    We've had issues elsewhere in the product where such guesses turn out to be wrong. Perhaps
the next row/batch has a non-empty list, but of strings. Or worse, of objects (maps.) Downstream
operators cannot handle this.
    The result is that a query fails for no better reason than we caused it to fail by guessing
the wrong type.
    Clearly, fixing the broader problem is beyond the scope of this fix. I am pointing out,
however, that a consequence of the assumptirnmade here is that some queries, somewhere later,
will fail due to an artificial schema change.
    The correct solution is to introduce an "Unknown" type and mark this a vector of type
"Unknown". All we know is that it is a list; the member types are unknown. Then, in downstream
operators, when we encounter a schema change, we know that an empty list of "Unknown" type
is compatible with a list of any other type (say maps.)

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message