drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Updike, Clark" <Clark.Upd...@jhuapl.edu>
Subject Re: NPE reading parquet files generated by Spark
Date Mon, 29 Jun 2020 20:58:02 GMT
Tried writing the parquet using spark.sql.parquet.writeLegacyFormat=true, same error.  I'm
running out of ideas on this.  

Just to show it's valid parquet:

parquet-tools cat /mnt/Drill/parqJsDf_0626v3/part-00000-9b54445b-7723-4e19-a145-e719e30da73f-c000.snappy.parquet
| head -n 5
id = tag:search.twitter.com,2005:792893798200160257
objectType = activity
actor:
.objectType = person
.id = id:twitter.com:63936789

´╗┐On 6/29/20, 9:35 AM, "Updike, Clark" <Clark.Updike@jhuapl.edu> wrote:

    APL external email warning: Verify sender user-return-11228-Clark.Updike=jhuapl.edu@drill.apache.org
before clicking links or attachments 
    
    I keep getting an NPE whenever I try to read parquet files generated by Spark using 1.18
nightly (June 9).
    
    $ ls /mnt/Drill/parqJsDf_0625/dt\=2016-10-31/ | head -n 2
        part-00000-blah.snappy.parquet
        part-00001-blah.snappy.parquet
    
    No matter how I query it:
        apache drill> select * from dfs.`mnt_drill`.`parqJsDf_0625` where dir0='dt\=2016-10-31'
limit 2;
        apache drill> select * from dfs.`mnt_drill`.`parqJsDf_0625` limit 2;
    
    I get an exception related to the partitioning:
    
    Caused By (java.lang.NullPointerException) null
        org.apache.drill.exec.store.parquet.ParquetGroupScanStatistics.checkForPartitionColumn():186
        org.apache.drill.exec.store.parquet.ParquetGroupScanStatistics.collect():119
        org.apache.drill.exec.store.parquet.ParquetGroupScanStatistics.<init>():59
        org.apache.drill.exec.store.parquet.BaseParquetMetadataProvider.getParquetGroupScanStatistics():293
        org.apache.drill.exec.store.parquet.BaseParquetMetadataProvider.getTableMetadata():249
        org.apache.drill.exec.store.parquet.BaseParquetMetadataProvider.initializeMetadata():203
        org.apache.drill.exec.store.parquet.BaseParquetMetadataProvider.init():170
        org.apache.drill.exec.metastore.store.parquet.ParquetTableMetadataProviderImpl.<init>():95
        org.apache.drill.exec.metastore.store.parquet.ParquetTableMetadataProviderImpl.<init>():48
        org.apache.drill.exec.metastore.store.parquet.ParquetTableMetadataProviderImpl$Builder.build():415
        org.apache.drill.exec.store.parquet.ParquetGroupScan.<init>():150
        org.apache.drill.exec.store.parquet.ParquetGroupScan.<init>():120
        org.apache.drill.exec.store.parquet.ParquetFormatPlugin.getGroupScan():202
        org.apache.drill.exec.store.parquet.ParquetFormatPlugin.getGroupScan():79
        org.apache.drill.exec.store.dfs.FileSystemPlugin.getPhysicalScan():226
        org.apache.drill.exec.store.dfs.FileSystemPlugin.getPhysicalScan():209
        org.apache.drill.exec.planner.logical.DrillTable.getGroupScan():119
        org.apache.drill.exec.planner.common.DrillScanRelBase.<init>():51
        org.apache.drill.exec.planner.logical.DrillScanRel.<init>():76
        org.apache.drill.exec.planner.logical.DrillScanRel.<init>():65
        org.apache.drill.exec.planner.logical.DrillScanRel.<init>():58
        org.apache.drill.exec.planner.logical.DrillScanRule.onMatch():38
        org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():208
        org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp():633
        org.apache.calcite.tools.Programs$RuleSetProgram.run():327
        org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():405
        org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():351
        org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel():245
        org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel():308
        org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():173
        org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():283
        org.apache.drill.exec.planner.sql.DrillSqlWorker.getPhysicalPlan():163
        org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan():128
        org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():93
        org.apache.drill.exec.work.foreman.Foreman.runSQL():593
    
    The files are valid parquet... I can use parquet tools on them just fine.  I can read
the same files back in using Spark.  I have tested with and without partitioning when writing
from Spark.  I have tried it both with and without snappy compression.  Always the same NPE.
 Any insight appreciated...
    
    

Mime
View raw message