drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kunal Khatua <kkha...@mapr.com>
Subject RE: Drill performance question
Date Mon, 30 Oct 2017 17:46:15 GMT
I second Ted's suggestion!

Since we haven't seen what your profile's operator overview, we can't say for sure why the
performance isn't good. 

On the top of my head ,these are most likely things happening that make your performance so
bad:

1. All the CSV files are being read and rows rejected because there is no way for Drill to
understand the which segments of data have the relevant time ranges that you might be looking
at. 
2. Your CSV data has many columns, but you only care about a few... CSV readers will need
to process the irrelevant ones too. 
3. There is a cost to reading and casting/converting the data into a date/time format. 

So, as Ted suggested, writing as a parquet file will give you the most bang for the buck.

Partitioning on, say, a date helps.. but you also don't want it too granular.

Last but not the least, if you are doing a query of the form..
select X,Y,Z where time between <startTime> and <endTime>
you will benefit immensely from the data being sorted with that time field. 

Hope that helps. 

~ Kunal

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, October 30, 2017 9:34 AM
To: user <user@drill.apache.org>
Subject: Re: Drill performance question

Also, on a practical note, Parquet will likely crush CSV on performance.
Columnar. Compressed. Binary.  All that.



On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra < saurabhmahapatra94@gmail.com> wrote:

> Hi Charles,
>
> Can you share some query patterns on this data? More specifically, the 
> number of columns you retrieving out of the total, the filter on the 
> time dimension itself (ranges and granularities)
>
> How much is ad hoc and how much is not.
>
> Best,
> Saurabh
>
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgivre@gmail.com> wrote:
>
> > Hello all,
> > I have a dataset consisting of about 16 GB of CSV files.  I am 
> > looking to do some time series analysis of this data, and created a 
> > view but when I started doing aggregate queries using components of 
> > the date, the performance was disappointing.  Would it be better to 
> > do a CTAS and partition by components of the date?  If so, would 
> > parquet be the best format?
> > Would anyone have other suggestions of things I could do to improve 
> > performance?
> > Thanks,
> > — C
>
Mime
View raw message