drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kunal Khatua <kkha...@maprtech.com>
Subject Re: How to set output file size for parquet to csv conversions?
Date Thu, 17 Nov 2016 01:49:36 GMT

There isn't a way to limit the size of individual files. However, there might be a way you
can get around this. 

One option is to serially convert each Parquet file into a CSV file separately. Typically,
a fragment within Drill will generate this CSV file. That might generate 12 CSV files for
each of the parquet files. 
If this doesn't work, you can use CTAS to generate smaller Parquet files from the existing
Parquet files and then retry the previous step.

However, the simplest approach is to use some utility to split the files to a reasonable size.
Linux has a utility called split. You should be able to find something similar on a Windows
desktop too if that is the OS for Drill.

~ Kunal
On Tue 15-Nov-2016 8:44:23 AM, Mariano, Laura J. <lmariano@draper.com> wrote:
Hello all,

I have a large database of Parquet files that I need to convert to csv so they can be read
into Matlab. (Unless someone knows how to do this automatically without an intermediate step?).
I am using CTAS to create csv files from a directory of Parquet files, which works fine when
the total size of the Parquet files is smallish (~= 800MB). In this case, CTAS will automatically
generate about a few csv files, ~200 - 300MB each, which are easily digestible by Matlab.

However, I have a dataset consisting of 34 Parquet files, totaling 18GB. When I run CTAS on
this to create the csv files using default parameters, it automatically generates 12 csv files,
>2GB each. This is approaching the upper limit of what Matlab can handle, so I am wondering
if there are any parameters I can set that will limit the size of the individual csv files
that are automatically created by the CTAS process. i.e. is there a parameter equivalent to
the store.parquet.block-size parameter for creating text files?

I am using Drill in embedded mode on a desktop.


Laura Mariano
Senior Member of the Technical Staff
(617) 258-2331

Notice: This email and any attachments may contain proprietary (Draper non-public) and/or
export-controlled information of Draper. If you are not the intended recipient of this email,
please immediately notify the sender by replying to this email and immediately destroy all
copies of this email.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message