drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peder Jakobsen | gmail" <pjakob...@gmail.com>
Subject Re: Creating a single parquet or csv file using CTAS command?
Date Thu, 04 Feb 2016 18:57:27 GMT
Hi Andries,   the trouble is that I run Drill on my desktop machine, but I
have no server available to me that is capable of running Drill.  Most
$10/month hosting accounts do not permit you to run java apps.  For this
reason I simply use Drill for "pre-processing" of the files that I
eventually will use in my simple 50 line python web app.

Even if I could run drill on my server, this seems like a lot of overhead
for something as simple as a flat file  with 6 columns.

On Thu, Feb 4, 2016 at 1:31 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> You can create multiple parquet files and have the ability to query them
> all through the Drill SQL interface with minimal overhead.
>
> Creating a single 50GB parquet file is likely not be the best option for
> performance, perhaps use Drill partitioning for the parquet files to speed
> up queries and reads in the future. Although parquet should be more
> efficient that CSV to store data. You can still limit Drill to a single
> thread to limit memory use for parquet CTAS and potentially number of files
> created.
>
> A bit of experimentation may help to find the optimum config for your use
> case.
>
> --Andries
>
> > On Feb 4, 2016, at 10:12 AM, Peder Jakobsen | gmail <pjakobsen@gmail.com>
> wrote:
> >
> > Sorry, bad typo:  I have 50GB of data, NOT 500GB  ;).  And I usually only
> > query a 1 GB subset of this data using Drill.
> >
> >
> >
> > On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail <
> pjakobsen@gmail.com>
> > wrote:
> >
> >> On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
> >> aengelbrecht@maprtech.com> wrote:
> >>
> >>> Is there a reason to create a single file? Typically you may want more
> >>> files to improve parallel operation on distributed systems like drill.
> >>>
> >>
> >> Good question.   I'm not actually using Drill for "big data".  In fact,
> I
> >> never deal with "big data", and I'm unlikely to ever  do so.
> >>
> >> But I do have 500 GB of CSV files spread across about 100 directories.
> >> They are all part of the same dataset, but this is how it's been
> organized
> >> by the government department who has released it as and Open Data dump
> >>
> >> Drill saves me the hassle of having to stitch these files together using
> >> python or awk. I love being able to just query the files using SQL (so
> far
> >> it's slow though, I need to figure out why - 18 seconds for a simple
> query
> >> is too much).   Data eventually needs to end up on the web to share it
> with
> >> other people, and I use crossfilter.js and D3.js for presentation.  I
> need
> >> fine grained control over online data presentation, and all BI tools
> I've
> >> seen are terrible in this department, eg. Tableau.
> >>
> >> So I need my data in a format that can be read by common web frameworks,
> >> and that usually implies dealing with a single file that can be
> uploaded to
> >> the web server.  No need for a database, since I'm just reading a few
> >> columns from a big flat file.
> >>
> >> I run my apps on a low cost virtual server. I don't have access to
> >> java/virtualbox/MongoDB etc.  Nor do I think these things are necessary:
> >> K.I.S.S
> >>
> >> So this use case may be quite different from many of the more
> "corporate"
> >> users, but Drill is so very useful regardless.
> >>
> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message