drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Drill performance question
Date Mon, 30 Oct 2017 16:36:00 GMT
The data itself contains 6 or so columns:  date, user_id, city, state, lat, long.   Iā€™m looking
to aggregate by week, by day of week etc.  So the general pattern would look something like:

SELECT EXTRACT( day FROM `date` ) AS _`day`, 
COUNT( DISTINCT id ) as distinct_id, 
COUNT( id ) as total_id
FROM <data>
GROUP BY EXTRACT( day FROM `date` )
ORDER BY `day ASC

The view I created cast all the fields to the appropriate data types. 
ā€” C

> On Oct 30, 2017, at 12:30, Saurabh Mahapatra <saurabhmahapatra94@gmail.com> wrote:
> 
> Hi Charles,
> 
> Can you share some query patterns on this data? More specifically, the
> number of columns you retrieving out of the total, the filter on the time
> dimension itself (ranges and granularities)
> 
> How much is ad hoc and how much is not.
> 
> Best,
> Saurabh
> 
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgivre@gmail.com> wrote:
> 
>> Hello all,
>> I have a dataset consisting of about 16 GB of CSV files.  I am looking to
>> do some time series analysis of this data, and created a view but when I
>> started doing aggregate queries using components of the date, the
>> performance was disappointing.  Would it be better to do a CTAS and
>> partition by components of the date?  If so, would parquet be the best
>> format?
>> Would anyone have other suggestions of things I could do to improve
>> performance?
>> Thanks,
>> ā€” C


Mime
View raw message