drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: DRILL 1.4 - newline in strings not supported
Date Mon, 01 Feb 2016 17:44:01 GMT
See inline.



On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <niparisco@gmail.com> wrote:

> ...
> @Ted,
>
> > If you have new lines in your files then the files becomes unsuitable for
> > splitting.  This means that the only parallelism available in a ctas
> > statement is multiple files
>
> ​Does it means newlines are incompatible with drill's distributed calculus
> ?
>

What it means is that the entire CSV file has to be read by a single
thread.  If you don't mind waiting as this happens, you would get the same
result.  Just slower without parallelism.


> Do you have a fair number of files?​
> ​I have one 30GB csv file. I don't know how many parquet file it could
> create as process crashes because of newlines.
> I can imagine approx 5 parquet files 500 MB.
>

That is reasonably small.

But as Jacques says later in the thread, this future work.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message