drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kincaid <kincaid.d...@gmail.com>
Subject Re: Drill query planning taking a LONG time
Date Tue, 14 Feb 2017 13:02:43 GMT
Thank you for the feedback. It seems there is nothing more I can do on my
end. What are my next steps? Shall I create a defect in the Drill Jira?

- Dave

On Mon, Feb 13, 2017 at 5:13 PM, Jinfeng Ni <jni@apache.org> wrote:

> The size of parquet files will matter in terms of meta data access
> time, which is just 212 ms according to your log file. My
> understanding is it does not matter too much to the overall planning
> times. That's why it probably makes sense to try over such a small toy
> example.
>
> Normally the planning time for such simple query should be much
> shorter than 12 minutes.  It indicates it could be caused by a code
> bug, or something else that we are currently unaware of.
>
>
>
>
>
>
> On Mon, Feb 13, 2017 at 2:47 PM, David Kincaid <kincaid.dave@gmail.com>
> wrote:
> > The example in DRILL-5183 is just a very small toy example to demonstrate
> > the bug with how Drill reads Parquet array fields. It doesn't have
> anything
> > to do with this planning issue (at least I don't think it does). Sorry
> if I
> > confused things with that reference.
> >
> > I just tried running our query directly against the table at
> > dfs.`parquet/transaction` and get the same result (12 minutes of planning
> > time). I disabled the cp and s3 storage plugins that were enabled so that
> > only the dfs storage plugin is enabled and the result is the same.
> >
> > Is this expected for Drill to take this long in the planning phase for a
> > query? Is there anything else I can try or information I could provide to
> > help identify the bug (seems like a bug to me)? I really appreciate you
> > guys helping out so quickly this afternoon.
> >
> > - Dave
> >
> > On Mon, Feb 13, 2017 at 4:13 PM, Jinfeng Ni <jni@apache.org> wrote:
> >
> >> I downloaded books.parquet from DRILL-5183, and created a view on top
> >> of this single parquet file. Then, run EXPLAIN for the query, and it
> >> completes within 1.2 seconds on Drill 1.8.0 release. (The # of parquet
> >> files would impact the time to fetch metadata. Since it's not the
> >> bottleneck in this case, it should not cause a big difference).
> >>
> >> Do you see the long planning time issue for this query only, or it
> >> happens for other queries as well? Besides the possibility of planning
> >> rule bugs, we once saw another possible cause of long planning issue.
> >> In your storage plugin configuration, if you enable some other storage
> >> plugin (for instance, hbase, or hive etc) which are slow to access,
> >> then those un-relevant storage plugin might impact your query as well.
> >> You may temporarily disable those storage plugins, and see if it's the
> >> cause of the problem.
> >>
> >> 0: jdbc:drill:zk=local> explain plan for
> >> . . . . . . . . . . . > select fltb1.sapId, yearmo,
> >> . . . . . . . . . . . > COUNT(*) as totalcnt,
> >> . . . . . . . . . . . > count(distinct(CASE
> >> . . . . . . . . . . . >                WHEN
> >> . . . . . . . . . . . >                (REPEATED_CONTAINS(fltb1.
> >> classLabels,
> >> . . . . . . . . . . . >
> >> 'Thing:Service:MedicalService:Diagnostic:Radiology:Ultrasound.*'))
> >> . . . . . . . . . . . >                THEN fltb1.invoiceId
> >> . . . . . . . . . . . >                END)) as ultracount,
> >> . . . . . . . . . . . > count(distinct (CASE
> >> . . . . . . . . . . . >                 WHEN
> >> . . . . . . . . . . . >                 (REPEATED_CONTAINS(fltb1.
> >> classLabels,
> >> . . . . . . . . . . . >
> >> 'Thing:Service:MedicalService:Diagnostic:LaboratoryTest.*'))
> >> . . . . . . . . . . . >                 THEN fltb1.invoiceId
> >> . . . . . . . . . . . >                 END)) as labcount
> >> . . . . . . . . . . . > from (
> >> . . . . . . . . . . . >   select sapid, invoiceId,
> >> . . . . . . . . . . . >         TO_CHAR(TO_TIMESTAMP(transactionDate,
> >> 'YYYY-MM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM') yearmo,
> >> . . . . . . . . . . . >         classLabels
> >> . . . . . . . . . . . >       from dfs.tmp.transactionView) fltb1
> >> . . . . . . . . . . . > group by fltb1.sapId, yearmo;
> >> +------+------+
> >> | text | json |
> >> +------+------+
> >> | 00-00    Screen
> >> 00-01      Project(sapId=[$0], yearmo=[$1], totalcnt=[$2],
> >> ultracount=[$3], labcount=[$4])
> >> ....................................
> >> 00-09                SelectionVectorRemover
> >> 00-12                  Sort(sort0=[$0], sort1=[$1], dir0=[ASC],
> dir1=[ASC])
> >> 00-15                    HashAgg(group=[{0, 1}], totalcnt=[COUNT()])
> >> ................................
> >> 00-22                          Scan(groupscan=[ParquetGroupScan
> >> [entries=[ReadEntryWithPath [path=file:/tmp/parquet/transaction]],
> >> selectionRoot=file:/tmp/parquet/transaction, numFiles=1,
> >> usedMetadataFile=false, columns=[`sapId`, `invoiceId`,
> >> `transactionDate`, `classLabels`.`array`]]])
> >>
> >> 1 row selected (1.195 seconds)
> >>
> >>
> >> On Mon, Feb 13, 2017 at 1:51 PM, David Kincaid <kincaid.dave@gmail.com>
> >> wrote:
> >> > Here is the entire transactionView.view.drill file. As you can see the
> >> view
> >> > itself is very simple and is just wrapping a syntactic problem with
> the
> >> > array field. That's an issue I reported in Jira under DRILL-5183 (
> >> > https://issues.apache.org/jira/browse/DRILL-5183)
> >> >
> >> > {
> >> >   "name" : "transactionView",
> >> >   "sql" : "SELECT `transactionRowKey`, `sapId`, `practiceName`,
> >> > `practiceCity`, `practiceState`, `practicePostalCode`, `animalId`,
> >> > `dateOfBirth`, `species`, `breed`, `gender`, `status`, `ownerId`,
> >> > `itemType`, `classification`, `subclass`, `practiceDescription`,
> >> > `clientDescription`, `invoiceId`, `unitOfMeasure`, `vendorName`,
> >> `vaccine`,
> >> > `rabies`, `vaccineType`, `price`, `quantity`, `transactionDate`,
> >> > `visitReason`, `speciesCode`, `genderCode`,
> `t`.`classLabels`['array'] AS
> >> > `classLabels`\nFROM `dfs`.`/parquet/transaction` AS `t`",
> >> >   "fields" : [ {
> >> >     "name" : "transactionRowKey",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "sapId",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "practiceName",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "practiceCity",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "practiceState",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "practicePostalCode",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "animalId",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "dateOfBirth",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "species",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "breed",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "gender",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "status",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "ownerId",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "itemType",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "classification",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "subclass",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "practiceDescription",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "clientDescription",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "invoiceId",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "unitOfMeasure",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "vendorName",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "vaccine",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "rabies",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "vaccineType",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "price",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "quantity",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "transactionDate",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "visitReason",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "speciesCode",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "genderCode",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   }, {
> >> >     "name" : "classLabels",
> >> >     "type" : "ANY",
> >> >     "isNullable" : true
> >> >   } ],
> >> >   "workspaceSchemaPath" : [ ]
> >> > }
> >> >
> >> > On Mon, Feb 13, 2017 at 3:47 PM, Jinfeng Ni <jni@apache.org> wrote:
> >> >
> >> >> Yes, the log confirmed that the planning, especially physical
> >> >> planning, is the one that took most of the time.
> >> >>
> >> >> If the definition of view s3.cisexport.transactionView is not very
> >> >> complicated (involves large # of tables), then it's possible that
> some
> >> >> planner rules have a bug. (In the past, we once saw couple of planner
> >> >> rules would be fired in a loop).
> >> >>
> >> >> Is it possible that you can share the DDL of the view?  That may help
> >> >> us re-produce the problem and take a look at the trace of Calcite,
> >> >> which Drill uses as the query planner.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message