Hi,
This is my first message to the user email list.
*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:
select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
verifiedSerial='1547NM70EN'
LIMIT 20;
However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.
*my configuration*:
- I am running Apache Drill 1.17 (the latest version now) in embedded
mode
- The parquet file is generated from Apache Spark (3.x version). E.g., I
uses such code to generate the parquet file:
`df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
- The parquet file is about 1TB in size
- My machine running drill has 32GB RAM, drill is pretty much the only
app running on that machine
What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.
My Question is:
1. Is the behavior I described expected?
2. If (1) is not expected, how can I avoid this behavior, what kind of
configuration change is needed? Or perhaps is this a bug?
Thanks in advance for any kind of help!
Regards,
Stone Zhong
|