drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Which perform better JSON or convert JSON to parquet format ?
Date Mon, 11 Jun 2018 08:49:04 GMT
I am going to play the contrarian here.

Parquet is not *always* faster than JSON.

The (almost unique) case where it is better to leave data as JSON (or
whatever) is when the average number of times that a file is read is equal
to or less than roughly 1.

The point is that to convert read the files n times in Parquet format, you
have to read the JSON once, write the Parquet and then read the Parquet n
times. The cost of reading the JSON n times is simply n times the cost of
reading the JSON (neglecting caches and such). As such, if n <= 1+epsilon,
JSON wins.

This isn't as strange a case as it might seem. For security logs, it is
common that the files are never read until you need them. That means that n
is nearly zero on average and n << 1 in any case. For incoming data, it is
common that there is an immediate transformation into an alternative form.
That might be pruning data or elaborating or aggregating. The point is that
the original data need not ever be re-written into Parquet format since it
is only ever read once. Transforming the format would wast time and space.

The other case of importance is where the read time is near zero for JSON.
Transforming to any other format will take near zero time and reading from
any other format will also be near zero. The win for transforming will be
near zero as well.


Having said all that, I agree that reading from Parquet will almost
certainly be faster and combining a bunch of small JSON files together into
a larger parquet file will be a real boon for frequently read data. It just
that faster isn't always better if there is a fixed cost.

On Mon, Jun 11, 2018 at 6:42 AM Padma Penumarthy <ppenumarthy@mapr.com>

> Yes, parquet is always better for multiple reasons. With JSON, we have to
> read the whole file
> from a single reader thread and have to parse to read individual columns.
> Parquet compresses and encodes data on disk. So, we read much less data
> from disk.
> Drill can read individual columns with in each rowgroup in parallel. Also,
> we can leverage
> features like filter pushdown, partition pruning, metadata cache for
> better query performance.
> Thanks
> Padma
> > On Jun 10, 2018, at 8:22 PM, Abhishek Girish <agirish@apache.org> wrote:
> >
> > I would suggest converting the JSON files to parquet for better
> > performance. JSON supports a more free form data model, so that's a
> > trade-off you need to consider, in my opinion.
> > On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot <divya.htconex@gmail.com>
> > wrote:
> >
> >> Hi,
> >> I am looking for the advise regarding the performance for below :
> >> 1. keep the JSON as is
> >> 2. Convert the JSON file to parquet files
> >>
> >> My JSON files data is not in fixed format and  file size varies from 10
> KB
> >> to 1 MB.
> >>
> >> Appreciate the community users advise on above !
> >>
> >>
> >> Thanks,
> >> Divya
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message