drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stone Zhong <stone.zh...@gmail.com>
Subject Query Parquet large parquet file, return empty columns
Date Wed, 08 Jan 2020 01:02:13 GMT

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

   - I am running Apache Drill 1.17 (the latest version now) in embedded
   - The parquet file is generated from Apache Spark (3.x version). E.g., I
   uses such code to generate the parquet file:
   - The parquet file is about 1TB in size
   - My machine running drill has 32GB RAM, drill is pretty much the only
   app running on that machine

What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

   1. Is the behavior I described expected?
   2. If (1) is not expected, how can I avoid this behavior, what kind of
   configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Stone Zhong

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message