drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Parquet Files Issue Summary
Date Wed, 01 Jun 2016 15:36:38 GMT
I know I have a few threads going here on my trials and tribulations, but
wanted to wrap a summary up here on what I am seeing and where I am with
support.  First of all, thanks to all who have been pointing me in the
right directions on things, it's greatly appreciated.

So a quick summary is I have some Parquet files in directories by day,
created on a Cloudera Cluster running parquet-mr 1.5-cdh. We are using
snappy compression, dictionary encoding and version 1_0 as a parquet
setting.  You can see the sizes of three days of data below.

Also below, it shows I am running this through a view because string
columns show up as binary in Drill, I use the convert_from(field, 'UTF8')
to get the proper strings. My goal was to take this data and CTAS to drill
created parquet files for optimal performance.

Problem 1: The array-index-out-of-bounds happened on particular field. This
 did not happen in Impala on the same exact files. (See "Reading and
converting Parquet files intended for Impala")

Problem 2: When experimenting, I found that I could set
and the CTAS would work.  That said, that setting did add 155 seconds to
one day that was working. Lots of time added.

Problem 3: All methods of CTAS (with or without the reader) created much
larger files than the Map Reduce job. My guess is the lack of dictionary

Problem 4: When I enabled dictionary encoding, the Array out of Bounds
issue still existed for the days with the troubled data, but did eventually
work on the other day.  The query took a LONG time, but made files that
were similar in size to the originals.

Problem 5: When I tried to use the new reader and the dictionary settings
together, I could put my cluster in a nasty state, it appears one of the
drill bits had a SIGSEGV. (see below) I have more information there, but
this is interesting, because instead of failing the query, it just hung
everything. (Theory: I have supervision on my drill bits, could it be that
a timeout period to fail out everything if a drillbit went down, wasn't
actually met, because as soon the drillbit crashed, my supervision
restarted the drillbit and that's what put the cluster into a bad state?
This is something to explore...)

So a number of problems listed here, these are only the "unresolved"
problems, in that via Paul, we identified that my gclogging wasn't actually
happening due to a bug in the drill startup scripts.   I implemented a work
around there.

So, MapR Support pointed out to me that for the hung cluster issue, my use
of the two variables isn't actually supported per
https://drill.apache.org/docs/configuration-options-introduction/.  This
leaves my Array out of bounds issue still active then, because my only
solution there was not supported.

Question 1:  On the dictionary encoding, isn't this a standard part of
Parquet? Why doesn't Drill support this? If it's planned, what is the
timeline for allowing a "supported" use of this feature (vs. "For internal
use. Do not change.")

Question 2: Similar question, while this is not as related to the standard
parquet project, but is there a timeline/roadmap for the planned (or not)
support of the new reader?

Question 3: I am working to get more data about the parquet files using the
parquet-tools, what other approaches may I take here?

Question 4: Am I missing anything crazy here?







Size of Input Parquet




Number of Row






Array out of bounds/OK

Array out of bounds/OK









Array out of bounds/OK

Array out of bounds/OK





Did not test

Did not test

Query Status/Cluster Status/Size/Time

103 Columns

All string columns run through CONVERT_FROM(field, 'UTF8') in view

Time during Hung Query




Web Server unresponsive, sqlline hung, no errors in logs, profile gone

Error in .out file on drillbit that was crashed and restarted.

Jun 1, 2016 2:34:16 PM INFO:
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 233,314B for
[threat_name] BINARY: 635,320 values, 364,413B raw, 232,979B comp, 5 pages,
encodings: [BIT_PACKED, PLAIN_DICTIONARY, RLE], dic { 1,481 entries,
76,680B raw, 1,481B comp}
Jun 1, 2016 2:34:16 PM INFO:
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 296B for
[parent_observation_event_id] INT64: 635,320 values, 71B raw, 81B comp, 5
pages, encodings: [BIT_PACKED, PLAIN_DICTIONARY,# # A fatal error has been
detected by the Java Runtime Environment:

#  SIGSEGV (0xb) at pc=0x00007fc3380520d0, pid=115847, tid=140474354837248
# # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build
1.8.0_91-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode
linux-amd64 compressed oops) # Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy

# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again # # An error
report file with more information is saved as:
[thread 140474362205952 also had an error][thread 140474370627328 also had
an error][thread 140474384312064 also had an error][thread 140474381154048
also had an error][thread 140474367469312 also had an error][thread
140474369574656 also had an error][thread 140474366416640 also had an
error][thread 140474382206720 also had an error][thread 140474375890688
also had an error][thread 140474387470080 also had an error][thread
140474360100608 also had an error][thread 140474374838016 also had an error]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message