drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thaddeus Diamond (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-389) Nested Parquet data generated from Hive does not work
Date Sat, 01 Mar 2014 04:42:19 GMT
Thaddeus Diamond created DRILL-389:
--------------------------------------

             Summary: Nested Parquet data generated from Hive does not work
                 Key: DRILL-389
                 URL: https://issues.apache.org/jira/browse/DRILL-389
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.0.0-milestone-1
         Environment: CentOS 6.3
CDH 4.6 installed by Cloudera Manager Free Edition
Hive 0.10.0
            Reporter: Thaddeus Diamond
         Attachments: avro_test.db, nobench.ddl, nobench_1.avsc

Inside of Hive, I generated Parquet data from Avro data as follows.  Using the attached Avro
file (avro_test.db) and the attached nested Avro schema (nobench_1.avsc), I created a Hive
table:

{noformat}
CREATE TABLE avro_nobench_hdfs
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hdfs/avro'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hdfs/nobench.avsc');
{noformat}

Note that this schema is based loosely off of the NoBench standard proposed by Craig Chasseur
for JSON (http://pages.cs.wisc.edu/~chasseur/).

In order to create a Parquet Hive table you need to create a full schema.  The one attached
is very large, so I used the following:

{noformat}
sudo -u hdfs hive -e 'describe avro_nobench_hdfs' > /tmp/temp.sql
{noformat}

Then, I replaced the "from deserializer" with commas and added the following SQL DDL around
it:

{noformat}
CREATE TABLE avro_nobench_parquet (
    // ... COLUMNS HERE
)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
{noformat}

Finally, I generated the actual Parquet binary data using {{INSERT INTO}}:

{noformat}
INSERT OVERWRITE avro_nobench_parquet SELECT * FROM avro_nobench_hdfs;
{noformat}

This successfully completed.  Then, the data was validated using:

{noformat}
SELECT COUNT(*) FROM avro_nobench_parquet;
SELECT * FROM avro_nobench_parquet LIMIT 1;
{noformat}

If you look in {{hdfs:///user/hive/warehouse/avro_nobench_parquet}} you'll see a single raw
file (something like {{0000_0}}).  Download that to local:

{noformat}
sudo -u hdfs hdfs dfs -copyToLocal /user/hive/warehouse/avro_nobench_parquet/* .
{noformat}

Then, in DRILL I ran:

{noformat}
SELECT COUNT(*) FROM "nobench.parquet";
{noformat}

And got the following:

{noformat}
Caused by: org.apache.drill.exec.rpc.RpcException: Remote failure while running query.[error_id:
"a13783d0-d9da-4639-8809-ba4a5ac54e04"
endpoint {
  address: "ip-10-101-1-82.ec2.internal"
  user_port: 31010
  bit_port: 32011
}
error_type: 0
message: "Failure while running fragment. < NullPointerException"
]
        at org.apache.drill.exec.rpc.user.QueryResultHandler.batchArrived(QueryResultHandler.java:72)
        at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:79)
        at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:48)
        at org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:33)
        at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:142)
        at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:127)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
        at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
        at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
        at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:173)
        at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334)
        at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:100)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:465)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:359)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
        at java.lang.Thread.run(Thread.java:744)
{noformat}

The second time I run it I get an OOM:

{noformat}
Exception in thread "WorkManager-3" java.lang.OutOfMemoryError: Java heap space
        at org.apache.drill.exec.store.parquet.PageReadStatus.<init>(PageReadStatus.java:41)
        at org.apache.drill.exec.store.parquet.ColumnReader.<init>(ColumnReader.java:70)
        at org.apache.drill.exec.store.parquet.VarLenBinaryReader$NullableVarLengthColumn.<init>(VarLenBinaryReader.java:62)
        at org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:167)
        at org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:99)
        at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:60)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:103)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:63)
        at org.apache.drill.exec.store.parquet.ParquetRowGroupScan.accept(ParquetRowGroupScan.java:107)
        at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:90)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:63)
        at org.apache.drill.exec.physical.config.Project.accept(Project.java:51)
        at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:121)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:63)
        at org.apache.drill.exec.physical.config.Sort.accept(Sort.java:58)
        at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:151)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:63)
        at org.apache.drill.exec.physical.config.StreamingAggregate.accept(StreamingAggregate.java:59)
        at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:132)
        at org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:63)
        at org.apache.drill.exec.physical.config.Screen.accept(Screen.java:102)
        at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:180)
        at org.apache.drill.exec.work.foreman.RunningFragmentManager.runFragments(RunningFragmentManager.java:84)
        at org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan(Foreman.java:228)
        at org.apache.drill.exec.work.foreman.Foreman.parseAndRunLogicalPlan(Foreman.java:176)
        at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:153)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message