drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Drill performance question
Date Tue, 31 Oct 2017 18:51:20 GMT
Thanks everyone!
Converting to partitioned parquet really did the trick.   I ended up partitioning by day and
hour, and while the queries aren’t instantaneous, they aren’t slow either. 

> On Oct 30, 2017, at 14:21, Andries Engelbrecht <aengelbrecht@mapr.com> wrote:
> As Ted touched on - CSV requires a lot of text parsing and data type conversion, while
parquet doesn’t. This typically saves a ton of CPU when you query, even if you can’t leverage
partitioning or columnar advantages.
> --Andries
> On 10/30/17, 10:57 AM, "Saurabh Mahapatra" <saurabhmahapatra94@gmail.com> wrote:
>    As Kunal points out, Parquet and partitioning should help. The concept of
>    partitioning is not new to warehousing type of queries. To get performance,
>    you aggregate and partition by the granularity of the query you are
>    seeking. For example, by day, by week, by month or even quarters. The goal
>    is to avoid on the fly aggregation for speed.
>    It is not always true that a columnar format will outperform a row-based
>    format in all situations. If you are doing point-access based analytics
>    where the selectivity of your filter is very very high and you want access
>    to all columns, row-based formats may actually do better.
>    Data layout and query characteristics are the yin-yang of performance for
>    any query engine on massive scale data. They need to match for the problem
>    you are solving.
>    Best,
>    Saurabh
>    On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua <kkhatua@mapr.com> wrote:
>> I second Ted's suggestion!
>> Since we haven't seen what your profile's operator overview, we can't say
>> for sure why the performance isn't good.
>> On the top of my head ,these are most likely things happening that make
>> your performance so bad:
>> 1. All the CSV files are being read and rows rejected because there is no
>> way for Drill to understand the which segments of data have the relevant
>> time ranges that you might be looking at.
>> 2. Your CSV data has many columns, but you only care about a few... CSV
>> readers will need to process the irrelevant ones too.
>> 3. There is a cost to reading and casting/converting the data into a
>> date/time format.
>> So, as Ted suggested, writing as a parquet file will give you the most
>> bang for the buck.
>> Partitioning on, say, a date helps.. but you also don't want it too
>> granular.
>> Last but not the least, if you are doing a query of the form..
>> select X,Y,Z where time between <startTime> and <endTime>
>> you will benefit immensely from the data being sorted with that time field.
>> Hope that helps.
>> ~ Kunal
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Monday, October 30, 2017 9:34 AM
>> To: user <user@drill.apache.org>
>> Subject: Re: Drill performance question
>> Also, on a practical note, Parquet will likely crush CSV on performance.
>> Columnar. Compressed. Binary.  All that.
>> On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
>> saurabhmahapatra94@gmail.com> wrote:
>>> Hi Charles,
>>> Can you share some query patterns on this data? More specifically, the
>>> number of columns you retrieving out of the total, the filter on the
>>> time dimension itself (ranges and granularities)
>>> How much is ad hoc and how much is not.
>>> Best,
>>> Saurabh
>>> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgivre@gmail.com> wrote:
>>>> Hello all,
>>>> I have a dataset consisting of about 16 GB of CSV files.  I am
>>>> looking to do some time series analysis of this data, and created a
>>>> view but when I started doing aggregate queries using components of
>>>> the date, the performance was disappointing.  Would it be better to
>>>> do a CTAS and partition by components of the date?  If so, would
>>>> parquet be the best format?
>>>> Would anyone have other suggestions of things I could do to improve
>>>> performance?
>>>> Thanks,
>>>> — C

View raw message