drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] paul-rogers commented on a change in pull request #2282: DRILL-7978: Fixed Width Format Plugin
Date Fri, 20 Aug 2021 00:06:16 GMT

paul-rogers commented on a change in pull request #2282:
URL: https://github.com/apache/drill/pull/2282#discussion_r692565008



##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import org.apache.drill.common.types.TypeProtos;
+
+@JsonTypeName("fixedwidthReaderFieldDescription")
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+public class FixedwidthFieldConfig {
+
+  private final TypeProtos.MinorType dataType;
+  private final String fieldName;
+  private final String dateTimeFormat;
+  private final int startIndex;
+  private final int fieldWidth;
+
+  public FixedwidthFieldConfig(@JsonProperty("dataType") TypeProtos.MinorType dataType,

Review comment:
       Does it work to use the `MinorType` here? Is that type set up for Jackson serialization?
I don't know the answer, just noting we should double-check to ensure it works OK.

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import org.apache.drill.common.types.TypeProtos;
+
+@JsonTypeName("fixedwidthReaderFieldDescription")
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+public class FixedwidthFieldConfig {
+
+  private final TypeProtos.MinorType dataType;
+  private final String fieldName;
+  private final String dateTimeFormat;
+  private final int startIndex;
+  private final int fieldWidth;
+
+  public FixedwidthFieldConfig(@JsonProperty("dataType") TypeProtos.MinorType dataType,
+                               @JsonProperty("fieldName") String fieldName,
+                               @JsonProperty("dateTimeFormat") String dateTimeFormat,
+                               @JsonProperty("startIndex") int startIndex,
+                               @JsonProperty("fieldWidth") int fieldWidth) {
+    this.dataType = dataType;
+    this.fieldName = fieldName;
+    this.dateTimeFormat = dateTimeFormat;
+    this.startIndex = startIndex;
+    this.fieldWidth = fieldWidth;

Review comment:
       Since configs are created by hand, having good defaults is helpful. Perhaps:
   
   * `name`: required; throw an exception if blank, or if the stripped name is not a valid
SQL symbol.
   * `type`: default to `VARCHAR`
   * `dateTimeFormat`: `null` is allowed, so no default.
   * `index`: required, must be >= 0.
   * `width`: either required, or can be optional. If provided must be > 0. (See below.)
   
   For this plugin, we also have to check the set of fields.
   
   * No two fields can have the same name.
   * Should fields be allowed to overlap?
   
   We could be clever. Scan all fields and sort into ascending order. If a field omits the
width, just compute it from the index of this and the next field.
   

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.LocalDateTime;
+import java.time.LocalTime;
+import java.time.ZoneId;
+import java.time.ZonedDateTime;
+import java.time.format.DateTimeFormatter;
+import java.util.Locale;
+
+public class FixedwidthBatchReader implements ManagedReader<FileSchemaNegotiator> {

Review comment:
       This uses "EVF V1". Your plugin provides schema information, and is thus a perfect
fit for "EVF V2" which can use your schema information to set up the row set loader schema
automatically for you.

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFieldConfig.java
##########
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import org.apache.drill.common.types.TypeProtos;
+
+@JsonTypeName("fixedwidthReaderFieldDescription")
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+public class FixedwidthFieldConfig {
+
+  private final TypeProtos.MinorType dataType;
+  private final String fieldName;
+  private final String dateTimeFormat;
+  private final int startIndex;
+  private final int fieldWidth;
+
+  public FixedwidthFieldConfig(@JsonProperty("dataType") TypeProtos.MinorType dataType,
+                               @JsonProperty("fieldName") String fieldName,
+                               @JsonProperty("dateTimeFormat") String dateTimeFormat,
+                               @JsonProperty("startIndex") int startIndex,
+                               @JsonProperty("fieldWidth") int fieldWidth) {

Review comment:
       Drill has no config builder UI, users have to create JSON configs by hand. We've found
it is helpful to keep field names short (Go-style.) So, perhaps:
   
   * `dataType` &rarr; `type`
   * `fieldName` &rarr; `name`
   * `dateTimeFormat` long, but OK
   * `startIndex` &rarr; `index`
   * `fieldWidth` &rarr; `width`
   
   Note that, since this is a `FieldConfig` we don't need the `field` prefix.
   
   Also, it can help readers if the "primary key" fields are first, so perhaps change the
order to
   
   * `name`
   * `index`
   * `width`
   * `type`
   * `dateTimeFormat`

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatPlugin.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import org.apache.drill.common.logical.StoragePluginConfig;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.server.DrillbitContext;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.hadoop.conf.Configuration;
+
+
+public class FixedwidthFormatPlugin extends EasyFormatPlugin<FixedwidthFormatConfig>
{
+
+  protected static final String DEFAULT_NAME = "fixedwidth";
+
+  private static class FixedwidthReaderFactory extends FileReaderFactory {
+
+    private final FixedwidthFormatConfig config;
+    private final int maxRecords;
+
+    public FixedwidthReaderFactory(FixedwidthFormatConfig config, int maxRecords) {
+      this.config = config;
+      this.maxRecords = maxRecords;
+    }
+
+    @Override
+    public ManagedReader<? extends FileSchemaNegotiator> newReader() {
+      return new FixedwidthBatchReader(config, maxRecords);
+    }
+  }
+
+  public FixedwidthFormatPlugin(String name,
+                                DrillbitContext context,
+                                Configuration fsConf,
+                                StoragePluginConfig storageConfig,
+                                FixedwidthFormatConfig formatConfig) {
+    super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig);
+  }
+
+  private static EasyFormatConfig easyConfig(Configuration fsConf, FixedwidthFormatConfig
pluginConfig) {
+    return EasyFormatConfig.builder()
+      .readable(true)
+      .writable(false)
+      .blockSplittable(false)

Review comment:
       I'm pretty sure fix-width files are splittable. If every records resides on a single
line, they the file is spittable if we add code that, if the start offset !=0, scan to the
next newline. And, on read, read rows until the file position is greater than the block end.
See the text file (CSV) plugin for details (though, don't follow its implementation as that
implementation is rather unique to that one use case.)

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthFormatConfig.java
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonInclude;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import org.apache.drill.common.PlanStringBuilder;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Objects;
+
+@JsonTypeName(FixedwidthFormatPlugin.DEFAULT_NAME)
+@JsonInclude(JsonInclude.Include.NON_DEFAULT)
+public class FixedwidthFormatConfig implements FormatPluginConfig {
+  private final List<String> extensions;
+  private final List<FixedwidthFieldConfig> fields;
+
+  @JsonCreator
+  public FixedwidthFormatConfig(@JsonProperty("extensions") List<String> extensions,
+                                @JsonProperty("fields") List<FixedwidthFieldConfig>
fields) {
+    this.extensions = extensions == null ? Collections.singletonList("fwf") : ImmutableList.copyOf(extensions);
+    this.fields = fields;
+  }
+
+  @JsonInclude(JsonInclude.Include.NON_DEFAULT)
+  public List<String> getExtensions() {
+    return extensions;
+  }
+
+  public List<FixedwidthFieldConfig> getFields() {
+    return fields;
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(extensions, fields);
+  }
+
+  @Override
+  public boolean equals(Object obj) {
+    if (this == obj) {
+      return true;
+    }
+    if (obj == null || getClass() != obj.getClass()) {
+      return false;
+    }
+    FixedwidthFormatConfig other = (FixedwidthFormatConfig) obj;
+    return Objects.equals(extensions, other.extensions)
+            && Objects.equals(fields, other.fields);
+  }
+
+  @Override
+  public String toString() {
+    return new PlanStringBuilder(this)
+            .field("extensions", extensions)
+            .field("fields", fields)
+            .toString();
+  }
+}

Review comment:
       Nit: GitHub is complaining about a lack of final newlines.

##########
File path: contrib/format-fixedwidth/src/main/java/org/apache/drill/exec/store/fixedwidth/FixedwidthBatchReader.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.fixedwidth;
+
+import org.apache.drill.common.exceptions.CustomErrorContext;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.LocalDateTime;
+import java.time.LocalTime;
+import java.time.ZoneId;
+import java.time.ZonedDateTime;
+import java.time.format.DateTimeFormatter;
+import java.util.Locale;
+
+public class FixedwidthBatchReader implements ManagedReader<FileSchemaNegotiator> {
+
+  private static final Logger logger = LoggerFactory.getLogger(FixedwidthBatchReader.class);
+  private FileSplit split;
+  private final int maxRecords;
+  private final FixedwidthFormatConfig config;
+  private CustomErrorContext errorContext;
+  private InputStream fsStream;
+  private ResultSetLoader loader;
+  private RowSetLoader writer;
+  private BufferedReader reader;
+  private int lineNum;
+
+  public FixedwidthBatchReader(FixedwidthFormatConfig config, int maxRecords) {
+    this.config = config;
+    this.maxRecords = maxRecords;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+    split = negotiator.split();
+    errorContext = negotiator.parentErrorContext();
+    lineNum = 0;
+    try {
+      fsStream = negotiator.fileSystem().openPossiblyCompressedStream(split.getPath());
+      negotiator.tableSchema(buildSchema(), true);
+      loader = negotiator.build();
+    } catch (Exception e) {
+      throw UserException
+        .dataReadError(e)
+        .message("Failed to open input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())
+        .build(logger);
+    }
+    reader = new BufferedReader(new InputStreamReader(fsStream, Charsets.UTF_8));
+    return true;
+  }
+
+  @Override
+  public boolean next() { // Use loader to read data from file to turn into Drill rows
+    String line;
+    RowSetLoader writer = loader.writer();
+
+    try {
+      line = reader.readLine();
+      while (!writer.isFull() && line != null) {
+        writer.start();
+        parseLine(line, writer);
+        writer.save();
+        line = reader.readLine();
+        lineNum++;
+      }
+    } catch (IOException e) {
+      throw UserException
+        .dataReadError(e)
+        .message("Failed to read input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())
+        .addContext("Line Number", lineNum)
+        .build(logger);
+    }
+    return writer.limitReached(maxRecords);  // returns false when maxRecords limit has been
reached
+  }
+
+  @Override
+  public void close() {
+    try {
+      fsStream.close();
+      loader.close();
+    } catch (Exception e) {
+      throw UserException
+        .dataReadError(e)
+        .message("Failed to close input file: {}", split.getPath().toString())
+        .addContext(errorContext)
+        .addContext(e.getMessage())
+        .build(logger);
+    }
+  }
+
+  private TupleMetadata buildSchema() {
+    SchemaBuilder builder = new SchemaBuilder();
+    for (FixedwidthFieldConfig field : config.getFields()) {
+      builder.addNullable(field.getFieldName(), field.getDataType());
+    }
+    return builder.buildSchema();
+  }
+
+
+  private boolean parseLine(String line, RowSetLoader writer) throws IOException {
+    int i = 0;
+    TypeProtos.MinorType dataType;
+    String dateTimeFormat;
+    String value;
+    for (FixedwidthFieldConfig field : config.getFields()) {
+      value = line.substring(field.getStartIndex() - 1, field.getStartIndex() + field.getFieldWidth()
- 1);
+      dataType = field.getDataType();
+      dateTimeFormat = field.getDateTimeFormat();
+      DateTimeFormatter formatter = DateTimeFormatter.ofPattern(dateTimeFormat, Locale.ENGLISH);
+      try {
+        switch (dataType) {

Review comment:
       This is OK, but slow because of the switch. There is a set of field converter classes
which can handle the string-to-whatever conversions. With that, there is a direct call per
field (inner loop) from reading the field to convert to write into value vectors. The field-specific-type
switch is done only once, at setup time.
   
   These converters are used in the CSV reader when a schema is provided. I can dig up more
examples if helpful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message