drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From PROJJWAL SAHA <proj.s...@gmail.com>
Subject Re: Query on performance using Drill and Amazon s3.
Date Wed, 22 Feb 2017 06:32:17 GMT
Thanks Nitin for the matrices you provided and the suggestions.

On Tue, Feb 21, 2017 at 2:23 PM, Nitin Pawar <nitinpawar432@gmail.com>
wrote:

> instead of doing select * in the first go,
> can you do query like select count(1)
>
> when your data is in csv files then yes all the data is transferred to the
> drill node and then query is executed on top of it.
> We had noticed the performance on csv was significantly more compared to
> parquet files, so we moved our data to parquet from csv and have not seen
> any issues on then.
>
> we did test run on 125M records, size was 8 GB in parquet and it took
> roughly 30 second or so.
>
> I would suggest two things
> 1) Which AWS region your S3 bucket is hosted  and which region your ec2
> servers are hosted?
> 2) If answer to above question is two different regions then you might want
> to move them into a single region.
>
> In either case, from AWS console you can figure out how much network
> throughput you are getting if that is the bottleneck
> Also drill machines would need CPU so along with 32GB memory if you have 8
> cores that would be desirable
>
> On Tue, Feb 21, 2017 at 2:17 PM, PROJJWAL SAHA <proj.saha@gmail.com>
> wrote:
>
> > Hi Nitin,
> >
> > I am executing the SQL query on a drillbit node using drill-conf .
> >  We have configured a 5 node drill cluster external to Amazon with 32GB
> > RAM. From one of the nodes, we are using drill-conf utility to fire the
> SQL
> > query.
> >
> > One observation is had is
> > select * from `xxx.tsv`
> > select * from `xxx.tsv` where yyy = 'zzz'
> >
> > Both these queries are taking almost the same time for 1 GB data with
> > 1000000 rows. So if the network for data transfer is the major time
> taking
> > component compared with the query execution time,  I think that the
> entire
> > data is first transferred to drill cluster and then the query is executed
> > on the drill cluster ?
> >
> > Regards,
> > Projjwal
> >
> > On Mon, Feb 20, 2017 at 6:18 PM, Nitin Pawar <nitinpawar432@gmail.com>
> > wrote:
> >
> > > how are you doing select * .. using drill UI or sqlline?
> > > where are you running it from ?
> > > is the drill hosted in aws or on your local machine?
> > >
> > > I think majority of the time is spent on displaying the result set
> > instead
> > > of querying the file if the drill server is on aws.
> > > If the drill server is local then it might be your network which might
> > take
> > > a lot of time based on s3 bucket location and where your drill server
> is
> > >
> > > On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <proj.saha@gmail.com>
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am using 1GB data in the form of .tsv file, stored in Amazon S3
> using
> > > > Drill 1.8. I am using default configurations of Drill using S3
> storage
> > > > plugin coming out of the box. The drill bits are configured on a 5
> node
> > > > cluster with 32GB RAM and 4VCPU.
> > > >
> > > > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> > > rows.
> > > >
> > > > Is this the expected behaviour ?
> > > > I am looking for any quick tuning that can improve the performance or
> > any
> > > > other suggestions.
> > > >
> > > > Attaching is the JSON profile for this query.
> > > >
> > > > Regards,
> > > > Projjwal
> > > >
> > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>
>
>
> --
> Nitin Pawar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message