drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: question about views
Date Mon, 30 Apr 2018 13:30:56 GMT
Hi

This looks an interesting design.

Am I correct such view
would hit the RDBMS for every query ?
However such view would hit the parquet file only when
the timestamp predicate would match a partition ?

Any news on a recent test to confirm the design ?

Thanks

2018-03-20 6:49 GMT+01:00 Ted Dunning <ted.dunning@gmail.com>:

> Aman,
>
> That is exactly the clarification that I needed. I had a hazy memory of a
> problem in this area, but not enough to actually figure out the current
> state.
>
> In case anybody cares, being able to do this is really handy. The basic
> idea is to keep long history in files and recent history in a DB. That
> allows you to create files with data that is advantageously sorted in order
> to get excellent compression. You can get nearly atomic switch-over to
> newly created files with lazy deletion of database entries by using a
> reference to a cutoff date in a database row. The file side would only look
> for data before the cutoff and the DB would only look for data after the
> cut. By positioning new files (created by CTAS on an about to be obsolete
> part of the DB) before changing the cutoff date, we get apparent atomicity.
>
> After the switch, and after a reasonable delay beyond that (to let all
> pending queries finish), the DB can be trimmed.
>
> Without a working pushdown through unions, this is all kind of pointless.
> If that is working now, it would be fabulous.
>
> An example of how big a win this can be, consider a use case where we want
> to keep all old states of customer preferences and context (say for a
> mobile phone). Almost all of the hundreds of settings for an individual
> would be unchanged even if a few do change. That means that if you could
> arrange a day (or more) of data by user id, the columnar compression of
> parquet would crush the data size. This only works, however, if you can
> collect a fair number of rows for each user. Thus the idea of a hybrid
> setup.
>
>
>
> On Mon, Mar 19, 2018 at 11:57 PM, Aman Sinha <amansinha@apache.org> wrote:
>
> > Due to an infinite loop occurring in Calcite planning, we had to disable
> > the filter pushdown past the union (SetOps).  See
> > https://issues.apache.org/jira/browse/DRILL-3855.
> > Now that we have rebased on Calcite 1.15.0, we should re-enable this and
> > test and if the pushdown works then the partition pruning on both sides
> of
> > the union should automatically work after that.
> >
> > Will follow-up on this..
> >
> > -Aman
> >
> > On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua <kunalkhatua@gmail.com>
> > wrote:
> >
> > > I think Ted's question is 2 fold, with the former being more important.
> > > 1. Can we push filters past a union.
> > > 2. Will Drill push filters down to the source.
> > >
> > > For the latter, it depends on the source.
> > > For the former, it depends primarily on whether Calcite supports this.
> I
> > > haven't tried it, so I can't say.
> > >
> > > On 3/19/2018 2:22:54 PM, rahul challapalli <challapallirahul@gmail.com
> >
> > > wrote:
> > > First I would suggest to ignore the view and try out a query which has
> > the
> > > required filters as part of the subqueries on both sides of the union
> > (for
> > > both the database and partitioned parquet data). The plan for such a
> > query
> > > should have the answers to your question. If both the subqueries
> > > independently prune out un-necessary data, using partitions or
> indexes, I
> > > don't think adding a union between them would alter that behavior.
> > >
> > > -Rahul
> > >
> > > On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
> > >
> > > > IF I create a view that is a union of partitioned parquet files and a
> > > > database that has secondary indexes, will Drill be able to properly
> > push
> > > > down query limits into both parts of the union?
> > > >
> > > > In particular, if I have lots of archival data and parquet
> partitioned
> > by
> > > > time but my query only asks for recent data that is in the database,
> > will
> > > > the query avoid the parquet files entirely (as you would wish)?
> > > >
> > > > Conversely, if the data I am asking for is entirely in the archive,
> > will
> > > > the query make use of the partitioning on my parquet files correctly?
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message