drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Reshetov <alexander.v.reshe...@gmail.com>
Subject Re: Performance degradation for UNION ALL parquet data sources
Date Thu, 08 Dec 2016 21:11:04 GMT
On Thu, Dec 8, 2016 at 11:45 PM, Jinfeng Ni <jni@apache.org> wrote:
> Can you please check the Explain plan output for the original query
> and the query against view, and see if there is any difference in the
> two query plans? The difference might be caused by UNION ALL operator,
> which might lead to different parallelization mode.


Hi,

Here is outputs.

All data in one file.

0: jdbc:drill:zk=local> explain plan for select action['login'],
count(*) from dfs.datastore.events_parquest group by action['login'];
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          Project(EXPR$0=[$0], EXPR$1=[$1])
01-02            HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-03              Project(EXPR$0=[$0], EXPR$1=[$1])
01-04                HashToRandomExchange(dist0=[[$0]])
02-01                  UnorderedMuxExchange
03-01                    Project(EXPR$0=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                      HashAgg(group=[{0}], EXPR$1=[COUNT()])
03-03                        Project(EXPR$0=[ITEM($0, 'login')])
03-04                          Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=file:/mnt/data/events_parquest]],
selectionRoot=file:/mnt/data/events_parquest, numFiles=1,
usedMetadataFile=false, columns=[`action`.`login`]]])


Data split in 4 files and combined with UNION ALL

0: jdbc:drill:zk=local> explain plan for select action['login'],
count(*) from dfs.datastore.parquet_synthetic_events_large_partition_all
group by action['login'];
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0], EXPR$1=[$1])
00-02        UnionExchange
01-01          Project(EXPR$0=[$0], EXPR$1=[$1])
01-02            HashAgg(group=[{0}], EXPR$1=[$SUM0($1)])
01-03              Project(EXPR$0=[$0], EXPR$1=[$1])
01-04                HashToRandomExchange(dist0=[[$0]])
02-01                  UnorderedMuxExchange
03-01                    Project(EXPR$0=[$0], EXPR$1=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0)])
03-02                      HashAgg(group=[{0}], EXPR$1=[COUNT()])
03-03                        Project(EXPR$0=[ITEM($2, 'login')])
03-04                          UnionAll(all=[true])
03-06                            Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-08                              UnionAll(all=[true])
03-10                                Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-12                                  UnionAll(all=[true])
03-14                                    Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-16
Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=file:/mnt/data/parquet_synthetic_events_large_partition_0]],
selectionRoot=file:/mnt/data/parquet_synthetic_events_large_partition_0,
numFiles=1, usedMetadataFile=false, columns=[`timestamp`, `client_id`,
`action`]]])
03-13                                    Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-15
Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=file:/mnt/data/parquet_synthetic_events_large_partition_1]],
selectionRoot=file:/mnt/data/parquet_synthetic_events_large_partition_1,
numFiles=1, usedMetadataFile=false, columns=[`timestamp`, `client_id`,
`action`]]])
03-09                                Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-11
Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=file:/mnt/data/parquet_synthetic_events_large_partition_2]],
selectionRoot=file:/mnt/data/parquet_synthetic_events_large_partition_2,
numFiles=1, usedMetadataFile=false, columns=[`timestamp`, `client_id`,
`action`]]])
03-05                            Project(timestamp=[$0],
client_id=[$1], action=[$2])
03-07                              Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/mnt/data/parquet_synthetic_events_large_partition_3]],
selectionRoot=file:/mnt/data/parquet_synthetic_events_large_partition_3,
numFiles=1, usedMetadataFile=false, columns=[`timestamp`, `client_id`,
`action`]]])

Mime
View raw message