drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peder Jakobsen | gmail" <pjakob...@gmail.com>
Subject Re: Creating a single parquet or csv file using CTAS command?
Date Thu, 04 Feb 2016 18:04:53 GMT
On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Is there a reason to create a single file? Typically you may want more
> files to improve parallel operation on distributed systems like drill.

 Good question.   I'm not actually using Drill for "big data".  In fact, I
never deal with "big data", and I'm unlikely to ever  do so.

But I do have 500 GB of CSV files spread across about 100 directories.
They are all part of the same dataset, but this is how it's been organized
by the government department who has released it as and Open Data dump

Drill saves me the hassle of having to stitch these files together using
python or awk. I love being able to just query the files using SQL (so far
it's slow though, I need to figure out why - 18 seconds for a simple query
is too much).   Data eventually needs to end up on the web to share it with
other people, and I use crossfilter.js and D3.js for presentation.  I need
fine grained control over online data presentation, and all BI tools I've
seen are terrible in this department, eg. Tableau.

So I need my data in a format that can be read by common web frameworks,
and that usually implies dealing with a single file that can be uploaded to
the web server.  No need for a database, since I'm just reading a few
columns from a big flat file.

I run my apps on a low cost virtual server. I don't have access to
java/virtualbox/MongoDB etc.  Nor do I think these things are necessary:

So this use case may be quite different from many of the more "corporate"
users, but Drill is so very useful regardless.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message