drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stone Zhong <stone.zh...@gmail.com>
Subject Query Parquet large parquet file, return empty columns
Date Wed, 08 Jan 2020 01:02:13 GMT
Hi,

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
    verifiedSerial='1547NM70EN'
LIMIT 20;

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

   - I am running Apache Drill 1.17 (the latest version now) in embedded
   mode
   - The parquet file is generated from Apache Spark (3.x version). E.g., I
   uses such code to generate the parquet file:
   `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
   - The parquet file is about 1TB in size
   - My machine running drill has 32GB RAM, drill is pretty much the only
   app running on that machine


What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

   1. Is the behavior I described expected?
   2. If (1) is not expected, how can I avoid this behavior, what kind of
   configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Regards,
Stone Zhong

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message