drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Altekruse (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-649) Unable to read dictionary encoded parquet file generated from impala or avro
Date Wed, 07 May 2014 02:41:23 GMT

    [ https://issues.apache.org/jira/browse/DRILL-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991481#comment-13991481

Jason Altekruse commented on DRILL-649:

I took a look at the file, I had only implemented the dictionary encoding for varchar fields.
I had seen that they added dictionaries for other types, but I thought that the varchar dictionaries
would be the ones blocking our reading of Impala generated files. They currently use ints
to index into the dictionary, which makes having a dictionary of floats or ints seemingly
useless, but with a cap on dictionary sizes around 50,000 they can still save some space by
bit packing the dictionary keys so each of them is stored in less than 4 bytes (we will have
to read each 'int' into memory, bit mask it to re-zero fill the value that was bit packed
and then use that to look up in the dictionary).

This is going to kill our read performance on the other types, because we have to materialize
everything at read time and can no longer use vector copies, but I'll get together a fix for
it before the end of the week to allow us at least to read the files. I'll try to get the
dictionaries off heap for performance, but I will focus first on just getting it working.

> Unable to read dictionary encoded parquet file generated from impala or avro
> ----------------------------------------------------------------------------
>                 Key: DRILL-649
>                 URL: https://issues.apache.org/jira/browse/DRILL-649
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Steven Phillips
>            Assignee: Jason Altekruse
>         Attachments: nation.parquet
> support for dictionary encoding was recently added, but it looks like some dictionary
encoded files are still unreadable by drill. For example, the parquet file created from an
avro file attached to DRILL-389 still fails.
> I also created a simple parquet file from impala, which also fails to read.
> I will attach the file.

This message was sent by Atlassian JIRA

View raw message