drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shankar Mane <shankar.m...@games24x7.com>
Subject Re: [Drill-Questions] Speed difference between GZ and BZ2
Date Mon, 01 Aug 2016 10:16:37 GMT
It is plain json (1 json per line).
Each json message size = ~4kb
no. of json messages = ~5 Millions.

store.parquet.compression = snappy ( i don't think, this parameter get
used. As I am querying select only.)


On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <kfaraaz@maprtech.com> wrote:

> What is the data format within those .gz and .bz2 files ? It is parquet or
> JSON or plain text (CSV) ?
> Also, what was this config parameter `store.parquet.compression` set to,
> when ypu ran your test ?
>
> - Khurram
>
> On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <shankar.mane@games24x7.com
> >
> wrote:
>
> > Awaiting for response..
> >
> > On 30-Jul-2016 3:20 PM, "Shankar Mane" <shankar.mane@games24x7.com>
> wrote:
> >
> > >
> >
> > > I am Comparing Querying speed between GZ and BZ2.
> > >
> > > Below are the 2 files and their sizes (This 2 files have same data):
> > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > >
> > >
> > >
> > > Results:
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (86.034 seconds)
> > >
> > >
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid
> ;
> > > +------------+----------+
> > > | channelid  |  EXPR$1  |
> > > +------------+----------+
> > > | 3          | 977134   |
> > > | 0          | 836850   |
> > > | 2          | 3202854  |
> > > +------------+----------+
> > > 3 rows selected (459.079 seconds)
> > >
> > >
> > >
> > > Questions:
> > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > 3. As bz2 is splittable format, How drill using it ?
> > >
> > >
> > > regards,
> > > shankar
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message