drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitalii Diravka <vita...@apache.org>
Subject Re: Drill JDBC Plugin limit queries
Date Sat, 20 Oct 2018 17:12:06 GMT
Rahul,

*double rows* is an estimate row count, which can be used in choosing right
Join operator, maybe somewhere else.
But to have a proper *PushLimitIntoScan *it is necessary to change *String*
*sql*.
Possibly it is necessary to keep *JdbcImplementor *or* JdbcImplementor.Result
*from* JdbcPrel*in class in *JdbcGroupScan *class
and change the sqlNode in the *applyLimit() *method.

Not sure why *DrillPushLimitToScanRule *is not matched. Is it added to the
planner program?
To find the reason of it you can compare the flow with Parquet Scan, for
instance.


On Fri, Oct 19, 2018 at 7:37 PM Rahul Raj <rahul.raj@option3.io> wrote:

> Vitalii,
>
> I made both the changes, it did not work and a full scan was issued as
> shown in the plan below.
>
> 00-00    Screen : rowType = RecordType(INTEGER actor_id, VARCHAR(45)
> first_name, VARCHAR(45) last_name, TIMESTAMP(3) last_update): rowcount
> = 5.0, cumulative cost = {120.5 rows, 165.5 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 227
> 00-01      Project(actor_id=[$0], first_name=[$1], last_name=[$2],
> last_update=[$3]) : rowType = RecordType(INTEGER actor_id, VARCHAR(45)
> first_name, VARCHAR(45) last_name, TIMESTAMP(3) last_update): rowcount
> = 5.0, cumulative cost = {120.0 rows, 165.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 226
> 00-02        SelectionVectorRemover : rowType = RecordType(INTEGER
> actor_id, VARCHAR(45) first_name, VARCHAR(45) last_name, TIMESTAMP(3)
> last_update): rowcount = 5.0, cumulative cost = {115.0 rows, 145.0
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 225
> 00-03          Limit(fetch=[5]) : rowType = RecordType(INTEGER
> actor_id, VARCHAR(45) first_name, VARCHAR(45) last_name, TIMESTAMP(3)
> last_update): rowcount = 5.0, cumulative cost = {110.0 rows, 140.0
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 224
> 00-04            Limit(fetch=[5]) : rowType = RecordType(INTEGER
> actor_id, VARCHAR(45) first_name, VARCHAR(45) last_name, TIMESTAMP(3)
> last_update): rowcount = 5.0, cumulative cost = {105.0 rows, 120.0
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 223
> 00-05              Jdbc(sql=[SELECT * FROM "public"."actor" ]) :
> rowType = RecordType(INTEGER actor_id, VARCHAR(45) first_name,
> VARCHAR(45) last_name, TIMESTAMP(3) last_update): rowcount = 100.0,
> cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 164
>
> Regards,
> Rahul
>
> On Fri, Oct 19, 2018 at 8:47 PM Rahul Raj <rahul.raj@option3.io> wrote:
>
> > I will make the changes and update you.
> >
> > Regards,
> > Rahul
> >
> > On Fri, Oct 19, 2018 at 1:05 AM Vitalii Diravka <vitalii@apache.org>
> > wrote:
> >
> >> Rahul,
> >>
> >> Possibly *JdbcGroupScan* can be improved, for instance by overriding
> >> *supportsLimitPushdown()* and *applyLimit()* methods,
> >> *double rows *field can be updated by the limit value.
> >>
> >> I've performed the following query: select * from mysql.`testdb`.`table`
> >> limit 2;
> >> but the following one is passed to MySQL: SELECT * FROM `testdb`.`table`
> >>
> >>
> https://github.com/apache/drill/blob/master/contrib/storage-jdbc/src/main/java/org/apache/drill/exec/store/jdbc/JdbcRecordReader.java#L187
> >> So it is definitely should be improved.
> >>
> >> *Note:* Changed mailing list to devs.
> >>
> >> On Sun, Oct 14, 2018 at 6:30 AM Rahul Raj <rahul.raj@option3.io> wrote:
> >>
> >> > Vitalii,
> >> >
> >> > Created documentation ticket DRILL-6794
> >> >
> >> > How do we proceed on extending the scan operators to support JDBC
> >> plugins?
> >> >
> >> > Regards,
> >> > Rahul
> >> >
> >> > On Sat, Oct 13, 2018 at 6:47 PM Vitalii Diravka <vitalii@apache.org>
> >> > wrote:
> >> >
> >> > > To update the documentation, since that issues were solved by using
> >> these
> >> > > properties in connection URL:
> >> > > defaultRowFetchSize=10000  [1]
> >> > > defaultAutoCommit=false    [2]
> >> > > The full URL was there "url": "jdbc:postgresql://
> >> > >
> >> > >
> >> >
> >>
> myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000
> >> > > "
> >> > >
> >> > > If some issues are still present, it is also reasonable to create
> >> tickets
> >> > > to track them.
> >> > >
> >> > > [1]
> >> > >
> >> > >
> >> >
> >>
> https://mail-archives.apache.org/mod_mbox/drill-user/201808.mbox/%3CCADN0Fn9066hwvu_ZyDJ24tkAoJH5hqXoysCv83z7DdSSfjr-CQ%40mail.gmail.com%3E
> >> > > [2]
> >> > >
> >> > >
> >> >
> >>
> https://mail-archives.apache.org/mod_mbox/drill-user/201808.mbox/%3C0d36e0e6e8dc1e77bbb67bbfde5f5296e290c075.camel%40omnicell.com%3E
> >> > >
> >> > > On Sat, Oct 13, 2018 at 3:56 PM Rahul Raj <rahul.raj@option3.io>
> >> wrote:
> >> > >
> >> > > > Should I create tickets to track these issues or should I create
a
> >> > ticket
> >> > > > to update the documentation?
> >> > > >
> >> > > > Rahul
> >> > > >
> >> > > > On Sat, Oct 13, 2018 at 6:16 PM Vitalii Diravka <
> vitalii@apache.org
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > 1. You are right, it means it is reasonable to extend this
rule
> >> for
> >> > > > > applying on other Scan operators (or possibly to create
the
> >> separate
> >> > > > one).
> >> > > > > 2. There was a question about OOM issues in Drill + PostgreSQL,
> >> > please
> >> > > > take
> >> > > > > a look [1].
> >> > > > >     Since you are trying to setup this configs, It will
be good,
> >> if
> >> > you
> >> > > > > create a Jira ticket to add this info to Drill docs [2]
> >> > > > >
> >> > > > > [1]
> >> > > > >
> >> > >
> >>
> https://mail-archives.apache.org/mod_mbox/drill-user/201808.mbox/browser
> >> > > > > [2] https://drill.apache.org/docs/rdbms-storage-plugin/
> >> > > > >
> >> > > > > On Sat, Oct 13, 2018 at 2:21 PM Rahul Raj <rahul.raj@option3.io
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Regarding the heap out of error, it could be that the
jdbc
> >> driver
> >> > is
> >> > > > > > prefetching the entire record set to memory. I just
had a look
> >> at
> >> > > > > > JdbcRecordReader, looks like by setting
> >> > connection#autoCommit(false)
> >> > > > and
> >> > > > > a
> >> > > > > > sufficient fetch size we could force the driver to
stream data
> >> as
> >> > > > > required.
> >> > > > > > This is how postgres driver works.
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor
> >> > > > > >
> >> > > > > > We will have to see the behaviour of other drivers
too.
> >> > > > > >
> >> > > > > > Let me know your thoughts here.
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Rahul
> >> > > > > >
> >> > > > > >
> >> > > > > > On Sat, Oct 13, 2018 at 3:47 PM Rahul Raj <
> rahul.raj@option3.io
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Vitalii,
> >> > > > > > >
> >> > > > > > > There are two concrete implementations of the
class -
> >> > > > > > > DrillPushLimitToScanRule LIMIT_ON_SCAN and
> >> > > > > > > DrillPushLimitToScanRule LIMIT_ON_PROJECT.
> >> > > > > > > LIMIT_ON_SCAN has a comment mentioning "For now
only applies
> >> to
> >> > > > > Parquet.
> >> > > > > > > And pushdown only apply limit but not offset"
> >> > > > > > >
> >> > > > > > > Also I enabled debug mode and found LIMIT is not
getting
> >> pushed
> >> > to
> >> > > > the
> >> > > > > > > query.
> >> > > > > > >     LimitPrel(fetch=[11]): rowcount = 11.0, cumulative
cost
> =
> >> > {83.0
> >> > > > > rows,
> >> > > > > > > 226.0 cpu, 0.0 io, 585728.0 network, 0.0 memory},
id = 261
> >> > > > > > >       UnionExchangePrel: rowcount = 11.0, cumulative
cost =
> >> {72.0
> >> > > > rows,
> >> > > > > > > 182.0 cpu, 0.0 io, 585728.0 network, 0.0 memory},
id = 259
> >> > > > > > >         LimitPrel(fetch=[11]): rowcount = 11.0,
cumulative
> >> cost =
> >> > > > {61.0
> >> > > > > > > rows, 94.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 257
> >> > > > > > >           JdbcPrel(sql=[SELECT * FROM
> >> > "u_g001"."executioniteration"
> >> > > > > WHERE
> >> > > > > > > "id" > 360500000 ]): rowcount = 50.0, cumulative
cost =
> {50.0
> >> > rows,
> >> > > > > 50.0
> >> > > > > > cpu
> >> > > > > > >
> >> > > > > > > Regarding the second point, its the java heap
getting filled
> >> with
> >> > > > jdbc
> >> > > > > > > results. How do we address this?
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > Rahul
> >> > > > > > >
> >> > > > > > > On Fri, Oct 12, 2018 at 8:11 PM Vitalii Diravka
<
> >> > > vitalii@apache.org>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > >> Hi Rahul,
> >> > > > > > >>
> >> > > > > > >> Drill has *DrillPushLimitToScanRule* [1] rule,
which should
> >> do
> >> > > this
> >> > > > > > >> optimization, whether the GroupScan supports
Limit Push
> Down.
> >> > > > > > >> Also you can verify in debug mode whether
this rule is
> fired.
> >> > > > > > >> Possibly for some external DB (like MapR-DB)
Drill should
> >> have
> >> > the
> >> > > > > > >> separate
> >> > > > > > >> class for this optimization [2].
> >> > > > > > >>
> >> > > > > > >> [1]
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java#L28
> >> > > > > > >> [2]
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/drill/pull/1466/files#diff-4819b70118487d81bc9c46a04b0eaaa3R37
> >> > > > > > >>
> >> > > > > > >> On Fri, Oct 12, 2018 at 3:19 PM Rahul Raj
<
> >> rahul.raj@option3.io
> >> > >
> >> > > > > wrote:
> >> > > > > > >>
> >> > > > > > >> > Hi,
> >> > > > > > >> >
> >> > > > > > >> > Drill does not push the LIMIT queries
to external
> databases
> >> > and
> >> > > I
> >> > > > > > >> assume it
> >> > > > > > >> > could be more related to Calcite. This
leads to out of
> >> memory
> >> > > > > > situations
> >> > > > > > >> > while querying large table to view few
records.  Is there
> >> > > > something
> >> > > > > > that
> >> > > > > > >> > could be improved here? One solutions
would be to push
> >> filters
> >> > > > down
> >> > > > > to
> >> > > > > > >> the
> >> > > > > > >> > DB and/or combined with some JDBC batch
size limit to
> >> flush a
> >> > > part
> >> > > > > as
> >> > > > > > >> > parquet.
> >> > > > > > >> >
> >> > > > > > >> > Regards,
> >> > > > > > >> > Rahul
> >> > > > > > >> >
> >> > > > > > >> > --
> >> > > > > > >> > _*** This email and any files transmitted
with it are
> >> > > confidential
> >> > > > > and
> >> > > > > > >> > intended solely for the use of the individual
or entity
> to
> >> > whom
> >> > > it
> >> > > > > is
> >> > > > > > >> > addressed. If you are not the named addressee
then you
> >> should
> >> > > not
> >> > > > > > >> > disseminate, distribute or copy this
e-mail. Please
> notify
> >> the
> >> > > > > sender
> >> > > > > > >> > immediately and delete this e-mail from
your system.***_
> >> > > > > > >> >
> >> > > > > > >>
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > _*** This email and any files transmitted with it are
> >> confidential
> >> > > and
> >> > > > > > intended solely for the use of the individual or entity
to
> whom
> >> it
> >> > is
> >> > > > > > addressed. If you are not the named addressee then
you should
> >> not
> >> > > > > > disseminate, distribute or copy this e-mail. Please
notify the
> >> > sender
> >> > > > > > immediately and delete this e-mail from your system.***_
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > > --
> >> > > > _*** This email and any files transmitted with it are confidential
> >> and
> >> > > > intended solely for the use of the individual or entity to whom
it
> >> is
> >> > > > addressed. If you are not the named addressee then you should
not
> >> > > > disseminate, distribute or copy this e-mail. Please notify the
> >> sender
> >> > > > immediately and delete this e-mail from your system.***_
> >> > > >
> >> > >
> >> >
> >> > --
> >> > _*** This email and any files transmitted with it are confidential and
> >> > intended solely for the use of the individual or entity to whom it is
> >> > addressed. If you are not the named addressee then you should not
> >> > disseminate, distribute or copy this e-mail. Please notify the sender
> >> > immediately and delete this e-mail from your system.***_
> >> >
> >>
> >
>
> --
> _*** This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom it is
> addressed. If you are not the named addressee then you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and delete this e-mail from your system.***_
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message