drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleksandr Kalinin <alexk...@gmail.com>
Subject DRILL-5797: new Parquet reader and ParquetSchema
Date Mon, 07 May 2018 09:46:38 GMT

Recently I resumed work on DRILL-5797 (enforcing usage of new parquet
reader on filtered non-complex columns), where submitted PR contains case
sensitivity issue (https://github.com/apache/drill/pull/976).

>From what I understood from original PR author (Damien Profeta), discussed
/ recommended way to solve it was to extract creation of ParquetSchema from
ParquetRecordReader into ParquetScanBatchCreator and passing it over to
ParquetRecordReader e.g. as constructor parameter. Indeed, ParquetSchema
takes case internally of case-insensitive schema handling.

I re-implemented the change that way, but hitting regression on skip
queries (select 'x' as mycol from ....). Skip queries fail because they
trigger empty list of selected columns in ParquetScanBatchCreator, and
creating ParquetSchema with empty column list causes failure later on in
execution. On the other hand, currently that list gets modified in
ParquetRecordReader internally in AbstractRecordReader.setColumns() by call
to getDefaultColumnsToRead() which makes it work.

Above issue can be solved by forcing schema rebuild in ParquetRecordReader
for skip queries to ensure inclusion of default column to selected columns,
which doesn't look nice though. Basically, regression seems to highlight
tight coupling of ParquetSchema creation to ParquetRecordReader. On top of
that, ParquetSchema.buildSchema() is taking batch size as parameter (always
default one), so I already had to add getter for external access in order
to call ParquetSchema.buildSchema() outside.

These small issues seem to signal that extracting ParquetSchema creation
outside of ParquetRecordReader is may be not the right way to go.

One idea to overcome this could be adding checker method to ParquetSchema
which would not create the full schema but would only perform validation
work of query suitability for the new reader. Alternatively, original PR
approach could be simply enhanced without using ParquetSchema at all (even
though some of the added code would probably duplicate what ParquetSchema
does). I didn't think about concrete ways yet, but both methods look doable.

Are there are opinions/suggestions on above?

Best Regards,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message