drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Query Parquet large parquet file, return empty columns
Date Wed, 08 Jan 2020 02:48:27 GMT
Hi Stone,

Welcome to the Drill mailing list!

The situation you describe is not the expected behavior. Something is going wrong.

You mention that, in the large-file case, you do get results, but the value is empty. This
suggests that the query did, in fact run (that is, it didn't run out of memory or encounter
some other internal problem.) The question is: why were the columns empty?

And, when you say "empty", do you mean NULL (SQL NULL) or an empty string. (Or, might they
be a string of spaces?)

You mention that you used both the Web console and API. Do you mean REST API? What results
do you get if you use the Sqlline command line tool?


Your configuration seems fine. Your query is a simple scan and filter, so not much memory
should be needed. 32 GB should be fine. You are using the default Drill memory options?

The odd thing is that you have two files: one small, one large. The small one works. Is the
schema the same for the two Parquet files?

Parquet files are divided into row groups. How large are the row groups in your file?


With only the data you provided, my hunch is that the problem is related to file size somehow,
but we will have to poke around to find where the problem lies.


Please do file a bug. It would be super helpful if you can include:
* The same description as in your e-mail.
* The actual output of the query (from Sqlline or the REST API.)

* An example small Parquet file. Mostly we will need to see the schema so we can try to reproduce
the problem.

Here is a possible workaround. Drill does not need the entire data set in one file: you can
create a directory of files. Try doing this with files of, say, 100MB or 256MB in size. A
side benefit is that, for real queries (not just LIMIT 10), having multiple files will let
Drill divide up the query to run scans in parallel so it can use all the CPUs on your machine.

Thanks,
- Paul

 

    On Tuesday, January 7, 2020, 5:14:12 PM PST, Stone Zhong <stone.zhong@gmail.com>
wrote:  
 
 Hi,

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
    verifiedSerial='1547NM70EN'
LIMIT 20;

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

  - I am running Apache Drill 1.17 (the latest version now) in embedded
  mode
  - The parquet file is generated from Apache Spark (3.x version). E.g., I
  uses such code to generate the parquet file:
  `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
  - The parquet file is about 1TB in size
  - My machine running drill has 32GB RAM, drill is pretty much the only
  app running on that machine


What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

  1. Is the behavior I described expected?
  2. If (1) is not expected, how can I avoid this behavior, what kind of
  configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Regards,
Stone Zhong
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message