drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Creating a single parquet or csv file using CTAS command?
Date Thu, 04 Feb 2016 19:15:26 GMT
On a desktop you will likely be limited on memory.

Perhaps set width to 1 to go on single threaded execution, and use 512MB or 1GB for parquet
block size pending how much memory the Drillbit has for direct memory. This will limit the
number of parquet files being created, see how much smaller the parquet file(s) are compared
to CSV on a subset of data. Will there be an issues with multiple parquet files in a directory?

Before Drill can write the parquet file it needs to build the structure in memory, so it needs
space for the input data and the parquet file in memory. A desktop will unlike have enough
memory for a single parquet file. Experiment with a subset of your data and see what works.

Have you looked at cloud solutions to process the data apart from just a basic hosting solution?
As an example you will find MapR with Drill available on demand on Amazon AWS and Azure. You
may want to look at that to spin up a node(s), load/process/download your data, and then spin
it down. Might be worth a look, pending your needs/budget.

--Andries


> On Feb 4, 2016, at 10:57 AM, Peder Jakobsen | gmail <pjakobsen@gmail.com> wrote:
> 
> Hi Andries,   the trouble is that I run Drill on my desktop machine, but I
> have no server available to me that is capable of running Drill.  Most
> $10/month hosting accounts do not permit you to run java apps.  For this
> reason I simply use Drill for "pre-processing" of the files that I
> eventually will use in my simple 50 line python web app.
> 
> Even if I could run drill on my server, this seems like a lot of overhead
> for something as simple as a flat file  with 6 columns.
> 
> On Thu, Feb 4, 2016 at 1:31 PM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> You can create multiple parquet files and have the ability to query them
>> all through the Drill SQL interface with minimal overhead.
>> 
>> Creating a single 50GB parquet file is likely not be the best option for
>> performance, perhaps use Drill partitioning for the parquet files to speed
>> up queries and reads in the future. Although parquet should be more
>> efficient that CSV to store data. You can still limit Drill to a single
>> thread to limit memory use for parquet CTAS and potentially number of files
>> created.
>> 
>> A bit of experimentation may help to find the optimum config for your use
>> case.
>> 
>> --Andries
>> 
>>> On Feb 4, 2016, at 10:12 AM, Peder Jakobsen | gmail <pjakobsen@gmail.com>
>> wrote:
>>> 
>>> Sorry, bad typo:  I have 50GB of data, NOT 500GB  ;).  And I usually only
>>> query a 1 GB subset of this data using Drill.
>>> 
>>> 
>>> 
>>> On Thu, Feb 4, 2016 at 1:04 PM, Peder Jakobsen | gmail <
>> pjakobsen@gmail.com>
>>> wrote:
>>> 
>>>> On Thu, Feb 4, 2016 at 11:15 AM, Andries Engelbrecht <
>>>> aengelbrecht@maprtech.com> wrote:
>>>> 
>>>>> Is there a reason to create a single file? Typically you may want more
>>>>> files to improve parallel operation on distributed systems like drill.
>>>>> 
>>>> 
>>>> Good question.   I'm not actually using Drill for "big data".  In fact,
>> I
>>>> never deal with "big data", and I'm unlikely to ever  do so.
>>>> 
>>>> But I do have 500 GB of CSV files spread across about 100 directories.
>>>> They are all part of the same dataset, but this is how it's been
>> organized
>>>> by the government department who has released it as and Open Data dump
>>>> 
>>>> Drill saves me the hassle of having to stitch these files together using
>>>> python or awk. I love being able to just query the files using SQL (so
>> far
>>>> it's slow though, I need to figure out why - 18 seconds for a simple
>> query
>>>> is too much).   Data eventually needs to end up on the web to share it
>> with
>>>> other people, and I use crossfilter.js and D3.js for presentation.  I
>> need
>>>> fine grained control over online data presentation, and all BI tools
>> I've
>>>> seen are terrible in this department, eg. Tableau.
>>>> 
>>>> So I need my data in a format that can be read by common web frameworks,
>>>> and that usually implies dealing with a single file that can be
>> uploaded to
>>>> the web server.  No need for a database, since I'm just reading a few
>>>> columns from a big flat file.
>>>> 
>>>> I run my apps on a low cost virtual server. I don't have access to
>>>> java/virtualbox/MongoDB etc.  Nor do I think these things are necessary:
>>>> K.I.S.S
>>>> 
>>>> So this use case may be quite different from many of the more
>> "corporate"
>>>> users, but Drill is so very useful regardless.
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message