spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Сергей Романов <romano...@inbox.ru.INVALID>
Subject Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.
Date Fri, 02 Sep 2016 11:03:25 GMT

Hi, Mich,

Column x29 does not seems to be any special. It's a newly created table and I did not calculate
stats for any columns. Actually, I can sum a single column several times in query and face
some landshift performance hit at some "magic" point. Setting "set spark.sql.codegen.wholeStage=false"
makes all requests run in a similar slow time (which is slower original time).
PS. Does Stats even helps for Spark queries?
SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,  SUM(`dd_convs`)
AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
LIMIT 30;
30 rows selected (1.37 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.399 seconds)

SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,  SUM(`actual_dsp_fee`)
AS `actual_dsp_fee`  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP
BY `advertiser_id`  LIMIT 30; 
30 rows selected (1.379 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.377 seconds)
SELECT field, SUM(x28) x repeat 40 times  FROM parquet_table WHERE partition = 1 GROUP BY
field -> 1.774s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) 
FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT
30;
30 rows selected (1.774 seconds)
30 rows selected (1.721 seconds)
30 rows selected (2.813 seconds)
SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE partition = 1 GROUP BY field
-> 7.314s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`, SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
LIMIT 30;
30 rows selected (7.314 seconds)
30 rows selected (7.27 seconds)
30 rows selected (7.279 seconds)
SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> 1.378s
0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM `slicer`.`573_slicer_rnd_13` 
WHERE dt = '2016-07-28';
plan  == Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L),
sum(dd_convs#159025L), ... 33 more fields])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L), partial_sum(dd_convs#159025L),
partial_sum(dd_convs#159025L), ... 33 more fields])
      +- *Project [dd_convs#159025L]
         +- *BatchedScan parquet slicer.573_slicer_rnd_13[dd_convs#159025L,dt#159005,etl_path#159006]
Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer...,
PushedFilters: [], ReadSchema: struct<dd_convs:bigint>

1 row selected (1.378 seconds)
1 row selected (1.849 seconds)
1 row selected (2.641 seconds)
SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE partition = 1 -> 5.733s
0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)
AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28';
plan  == Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L), sum(dd_convs#158694L),
sum(dd_convs#158694L), ... 34 more fields])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L), partial_sum(dd_convs#158694L),
partial_sum(dd_convs#158694L), ... 34 more fields])
      +- *Project [dd_convs#158694L]
         +- *BatchedScan parquet slicer.573_slicer_rnd_13[dd_convs#158694L,dt#158674,etl_path#158675]
Format: ParquetFormat, InputPaths: hdfs://spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer...,
PushedFilters: [], ReadSchema: struct<dd_convs:bigint>

1 row selected (5.733 seconds)
1 row selected (5.693 seconds)
1 row selected (4.67 seconds)

SET spark.sql.codegen.wholeStage=false;
Without CodeGen: SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1
-> 13.712s
Without CodeGen: SELECT SUM(x28) x repeat 58 times FROM parquet_table WHERE partition = 1
-> 13.405s


>Четверг,  1 сентября 2016, 19:35 +03:00 от Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>
>What happens if you run the following query on its own. How long it takes? 
>
>SELECT field, SUM(x29) FROM FROM parquet_table WHERE partition = 1 GROUP BY field
>
>Have Stats been updated for all columns in Hive? And the type x29 field?
>
>HTH
>
>
>Dr Mich Talebzadeh
> 
>LinkedIn    https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
>http://talebzadehmich.wordpress.com
>
>Disclaimer:  Use it at your own risk. Any and all responsibility for any loss, damage
or destruction
of data or any other property which may arise from relying on this email's technical content
is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction. 
> 
>On 1 September 2016 at 16:55, Сергей Романов  < romanovsa@inbox.ru.invalid
> wrote:
>>Hi, 
>>
>>When I run a query like "SELECT field, SUM(x1), SUM(x2)... SUM(x28) FROM parquet_table
WHERE partition = 1 GROUP BY field" it runs in under 2 seconds, but when I add just one more
aggregate field to the query "SELECT field, SUM(x1), SUM(x2)... SUM(x28), SUM(x29) FROM parquet_table
WHERE partition = 1 GROUP BY field" it runs in about 12 seconds. 
>>
>>Why does it happens? Can I make second query run as fast as first one? I tried browsing
logs in TRACE mode and comparing CODEGEN but everything looks pretty much the same excluding
execution time.
>>
>>Can this be related to SPARK-17115 ?
>>
>>I'm using Spark 2.0 Thrift Server over YARN/HDFS with partitioned parquet hive tables.

>>
>>Complete example using beeline: 
>>
>>0: jdbc:hive2://spark-master1.uslicer> DESCRIBE EXTENDED `slicer`.`573_slicer_rnd_13`;

>>col_name,data_type,comment 
>>actual_dsp_fee,float,NULL 
>>actual_pgm_fee,float,NULL 
>>actual_ssp_fee,float,NULL 
>>advertiser_id,int,NULL 
>>advertiser_spent,double,NULL 
>>anomaly_clicks,bigint,NULL 
>>anomaly_conversions_filtered,bigint,NULL 
>>anomaly_conversions_unfiltered,bigint,NULL 
>>anomaly_decisions,float,NULL 
>>bid_price,float,NULL 
>>campaign_id,int,NULL 
>>click_prob,float,NULL 
>>clicks,bigint,NULL 
>>clicks_static,bigint,NULL 
>>conv_prob,float,NULL 
>>conversion_id,bigint,NULL 
>>conversions,bigint,NULL 
>>creative_id,int,NULL 
>>dd_convs,bigint,NULL 
>>decisions,float,NULL 
>>dmp_liveramp_margin,float,NULL 
>>dmp_liveramp_payout,float,NULL 
>>dmp_nielsen_margin,float,NULL 
>>dmp_nielsen_payout,float,NULL 
>>dmp_rapleaf_margin,float,NULL 
>>dmp_rapleaf_payout,float,NULL 
>>e,float,NULL 
>>expected_cpa,float,NULL 
>>expected_cpc,float,NULL 
>>expected_payout,float,NULL 
>>first_impressions,bigint,NULL 
>>fraud_clicks,bigint,NULL 
>>fraud_impressions,bigint,NULL 
>>g,float,NULL 
>>impressions,float,NULL 
>>line_item_id,int,NULL 
>>mail_type,string,NULL 
>>noads,float,NULL 
>>predict_version,bigint,NULL 
>>publisher_id,int,NULL 
>>publisher_revenue,double,NULL 
>>pvc,bigint,NULL 
>>second_price,float,NULL 
>>thirdparty_margin,float,NULL 
>>thirdparty_payout,float,NULL 
>>dt,string,NULL 
>>etl_path,string,NULL 
>># Partition Information,, 
>># col_name,data_type,comment 
>>dt,string,NULL 
>>etl_path,string,NULL 
>>
>>
>>data_type  CatalogTable( 
>>        Table: `slicer`.`573_slicer_rnd_13` 
>>        Owner: spark 
>>        Created: Fri Aug 12 12:30:20 UTC 2016 
>>        Last Access: Thu Jan 01 00:00:00 UTC 1970 
>>        Type: MANAGED 
>>        Schema: [`actual_dsp_fee` float, `actual_pgm_fee` float, `actual_ssp_fee`
float, `advertiser_id` int, `advertiser_spent` double, `anomaly_clicks` bigint, `anomaly_conversions_filtered`
bigint, `anomaly_conversions_unfiltered` bigint, `anomaly_decisions` float, `bid_price` float,
`campaign_id` int, `click_prob` float, `clicks` bigint, `clicks_static` bigint, `conv_prob`
float, `conversion_id` bigint, `conversions` bigint, `creative_id` int, `dd_convs` bigint,
`decisions` float, `dmp_liveramp_margin` float, `dmp_liveramp_payout` float, `dmp_nielsen_margin`
float, `dmp_nielsen_payout` float, `dmp_rapleaf_margin` float, `dmp_rapleaf_payout` float,
`e` float, `expected_cpa` float, `expected_cpc` float, `expected_payout` float, `first_impressions`
bigint, `fraud_clicks` bigint, `fraud_impressions` bigint, `g` float, `impressions` float,
`line_item_id` int, `mail_type` string, `noads` float, `predict_version` bigint, `publisher_id`
int, `publisher_revenue` double, `pvc` bigint, `second_price` float, `thirdparty_margin` float,
`thirdparty_payout` float, `dt` string, `etl_path` string] 
>>        Partition Columns: [`dt`, `etl_path`] 
>>        Properties: [transient_lastDdlTime=1471005020] 
>>        Storage(Location: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer_rnd_13
, InputFormat:  org.apache.hadoop.hive.ql.io .parquet.MapredParquetInputFormat, OutputFormat:
 org.apache.hadoop.hive.ql.io .parquet.MapredParquetOutputFormat, Serde:  org.apache.hadoop.hive.ql.io
.parquet.serde.ParquetHiveSerDe, Properties: [serialization.format=1])) 
>>comment     
>>
>>
>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS `advertiser_id`,
SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`)
AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`,
SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`)
AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS
`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`)
AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`)
AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`,
SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`)
AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS
`e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`)
AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`)
AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee`
FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT
30; 
>>plan  == Physical Plan == 
>>CollectLimit 30 
>>+- *HashAggregate(keys=[advertiser_id#13866], functions=[sum(conversions#13879L),
sum(cast(dmp_rapleaf_margin#13887 as double)), sum(pvc#13904L), sum(cast(dmp_nielsen_payout#13886
as double)), sum(fraud_clicks#13894L), sum(cast(impressions#13897 as double)), sum(cast(conv_prob#13877
as double)), sum(cast(dmp_liveramp_payout#13884 as double)), sum(cast(decisions#13882 as double)),
sum(fraud_impressions#13895L), sum(advertiser_spent#13867), sum(cast(actual_ssp_fee#13865
as double)), sum(cast(dmp_nielsen_margin#13885 as double)), sum(first_impressions#13893L),
sum(clicks#13875L), sum(cast(second_price#13905 as double)), sum(cast(click_prob#13874 as
double)), sum(clicks_static#13876L), sum(cast(expected_payout#13892 as double)), sum(cast(bid_price#13872
as double)), sum(cast(noads#13900 as double)), sum(cast(e#13889 as double)), sum(cast(g#13896
as double)), sum(publisher_revenue#13903), ... 5 more fields]) 
>>   +- Exchange hashpartitioning(advertiser_id#13866, 3) 
>>      +- *HashAggregate(keys=[advertiser_id#13866], functions=[partial_sum(conversions#13879L),
partial_sum(cast(dmp_rapleaf_margin#13887 as double)), partial_sum(pvc#13904L), partial_sum(cast(dmp_nielsen_payout#13886
as double)), partial_sum(fraud_clicks#13894L), partial_sum(cast(impressions#13897 as double)),
partial_sum(cast(conv_prob#13877 as double)), partial_sum(cast(dmp_liveramp_payout#13884 as
double)), partial_sum(cast(decisions#13882 as double)), partial_sum(fraud_impressions#13895L),
partial_sum(advertiser_spent#13867), partial_sum(cast(actual_ssp_fee#13865 as double)), partial_sum(cast(dmp_nielsen_margin#13885
as double)), partial_sum(first_impressions#13893L), partial_sum(clicks#13875L), partial_sum(cast(second_price#13905
as double)), partial_sum(cast(click_prob#13874 as double)), partial_sum(clicks_static#13876L),
partial_sum(cast(expected_payout#13892 as double)), partial_sum(cast(bid_price#13872 as double)),
partial_sum(cast(noads#13900 as double)), partial_sum(cast(e#13889 as double)), partial_sum(cast(g#13896
as double)), partial_sum(publisher_revenue#13903), ... 5 more fields]) 
>>         +- *Project [actual_dsp_fee#13863, actual_pgm_fee#13864, actual_ssp_fee#13865,
advertiser_id#13866, advertiser_spent#13867, bid_price#13872, click_prob#13874, clicks#13875L,
clicks_static#13876L, conv_prob#13877, conversions#13879L, dd_convs#13881L, decisions#13882,
dmp_liveramp_margin#13883, dmp_liveramp_payout#13884, dmp_nielsen_margin#13885, dmp_nielsen_payout#13886,
dmp_rapleaf_margin#13887, dmp_rapleaf_payout#13888, e#13889, expected_payout#13892, first_impressions#13893L,
fraud_clicks#13894L, fraud_impressions#13895L, ... 6 more fields] 
>>            +- *BatchedScan parquet slicer.573_slicer_rnd_13[actual_dsp_fee#13863,actual_pgm_fee#13864,actual_ssp_fee#13865,advertiser_id#13866,advertiser_spent#13867,bid_price#13872,click_prob#13874,clicks#13875L,clicks_static#13876L,conv_prob#13877,conversions#13879L,dd_convs#13881L,decisions#13882,dmp_liveramp_margin#13883,dmp_liveramp_payout#13884,dmp_nielsen_margin#13885,dmp_nielsen_payout#13886,dmp_rapleaf_margin#13887,dmp_rapleaf_payout#13888,e#13889,expected_payout#13892,first_impressions#13893L,fraud_clicks#13894L,fraud_impressions#13895L,...
8 more fields] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer..
., PushedFilters: [], ReadSchema: struct<actual_dsp_fee:float,actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertise...

>>
>>
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,
SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`)
AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`,
SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`)
AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS
`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`)
AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`)
AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`,
SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`)
AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS
`e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`)
AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`)
AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`, SUM(`actual_dsp_fee`) AS `actual_dsp_fee`
FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT
30; 
>>
>>(results for three runs) 
>>30 rows selected (11.904 seconds) 
>>30 rows selected (11.703 seconds) 
>>30 rows selected (11.52 seconds) 
>>
>>XXX 
>>
>>0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT `advertiser_id` AS `advertiser_id`,
SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`)
AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`,
SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`)
AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS
`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`)
AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`)
AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`,
SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`)
AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS
`e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`)
AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`)
AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`
 WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>plan  == Physical Plan == 
>>CollectLimit 30 
>>+- *HashAggregate(keys=[advertiser_id#15269], functions=[sum(conversions#15282L),
sum(cast(dmp_rapleaf_margin#15290 as double)), sum(pvc#15307L), sum(cast(dmp_nielsen_payout#15289
as double)), sum(fraud_clicks#15297L), sum(cast(impressions#15300 as double)), sum(cast(conv_prob#15280
as double)), sum(cast(dmp_liveramp_payout#15287 as double)), sum(cast(decisions#15285 as double)),
sum(fraud_impressions#15298L), sum(advertiser_spent#15270), sum(cast(actual_ssp_fee#15268
as double)), sum(cast(dmp_nielsen_margin#15288 as double)), sum(first_impressions#15296L),
sum(clicks#15278L), sum(cast(second_price#15308 as double)), sum(cast(click_prob#15277 as
double)), sum(clicks_static#15279L), sum(cast(expected_payout#15295 as double)), sum(cast(bid_price#15275
as double)), sum(cast(noads#15303 as double)), sum(cast(e#15292 as double)), sum(cast(g#15299
as double)), sum(publisher_revenue#15306), ... 4 more fields]) 
>>   +- Exchange hashpartitioning(advertiser_id#15269, 3) 
>>      +- *HashAggregate(keys=[advertiser_id#15269], functions=[partial_sum(conversions#15282L),
partial_sum(cast(dmp_rapleaf_margin#15290 as double)), partial_sum(pvc#15307L), partial_sum(cast(dmp_nielsen_payout#15289
as double)), partial_sum(fraud_clicks#15297L), partial_sum(cast(impressions#15300 as double)),
partial_sum(cast(conv_prob#15280 as double)), partial_sum(cast(dmp_liveramp_payout#15287 as
double)), partial_sum(cast(decisions#15285 as double)), partial_sum(fraud_impressions#15298L),
partial_sum(advertiser_spent#15270), partial_sum(cast(actual_ssp_fee#15268 as double)), partial_sum(cast(dmp_nielsen_margin#15288
as double)), partial_sum(first_impressions#15296L), partial_sum(clicks#15278L), partial_sum(cast(second_price#15308
as double)), partial_sum(cast(click_prob#15277 as double)), partial_sum(clicks_static#15279L),
partial_sum(cast(expected_payout#15295 as double)), partial_sum(cast(bid_price#15275 as double)),
partial_sum(cast(noads#15303 as double)), partial_sum(cast(e#15292 as double)), partial_sum(cast(g#15299
as double)), partial_sum(publisher_revenue#15306), ... 4 more fields]) 
>>         +- *Project [actual_pgm_fee#15267, actual_ssp_fee#15268, advertiser_id#15269,
advertiser_spent#15270, bid_price#15275, click_prob#15277, clicks#15278L, clicks_static#15279L,
conv_prob#15280, conversions#15282L, dd_convs#15284L, decisions#15285, dmp_liveramp_margin#15286,
dmp_liveramp_payout#15287, dmp_nielsen_margin#15288, dmp_nielsen_payout#15289, dmp_rapleaf_margin#15290,
dmp_rapleaf_payout#15291, e#15292, expected_payout#15295, first_impressions#15296L, fraud_clicks#15297L,
fraud_impressions#15298L, g#15299, ... 5 more fields] 
>>            +- *BatchedScan parquet slicer.573_slicer_rnd_13[actual_pgm_fee#15267,actual_ssp_fee#15268,advertiser_id#15269,advertiser_spent#15270,bid_price#15275,click_prob#15277,clicks#15278L,clicks_static#15279L,conv_prob#15280,conversions#15282L,dd_convs#15284L,decisions#15285,dmp_liveramp_margin#15286,dmp_liveramp_payout#15287,dmp_nielsen_margin#15288,dmp_nielsen_payout#15289,dmp_rapleaf_margin#15290,dmp_rapleaf_payout#15291,e#15292,expected_payout#15295,first_impressions#15296L,fraud_clicks#15297L,fraud_impressions#15298L,g#15299,...
7 more fields] Format: ParquetFormat, InputPaths: hdfs:// spark-master1.uslicer.net:8020/user/hive/warehouse/slicer.db/573_slicer..
., PushedFilters: [], ReadSchema: struct<actual_pgm_fee:float,actual_ssp_fee:float,advertiser_id:int,advertiser_spent:double,bid_pr...

>>
>>
>>0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS `advertiser_id`,
SUM(`conversions`) AS `conversions`, SUM(`dmp_rapleaf_margin`) AS `dmp_rapleaf_margin`, SUM(`pvc`)
AS `pvc`, SUM(`dmp_nielsen_payout`) AS `dmp_nielsen_payout`,  SUM(`fraud_clicks`) AS `fraud_clicks`,
SUM(`impressions`) AS `impressions`,  SUM(`conv_prob`) AS `conv_prob`, SUM(`dmp_liveramp_payout`)
AS `dmp_liveramp_payout`,  SUM(`decisions`) AS `decisions`,  SUM(`fraud_impressions`) AS
`fraud_impressions`, SUM(`advertiser_spent`) AS `advertiser_spent`, SUM(`actual_ssp_fee`)
AS `actual_ssp_fee`,  SUM(`dmp_nielsen_margin`) AS `dmp_nielsen_margin`, SUM(`first_impressions`)
AS `first_impressions`, SUM(`clicks`) AS `clicks`, SUM(`second_price`) AS `second_price`,
SUM(`click_prob`) AS `click_prob`, SUM(`clicks_static`) AS `clicks_static`, SUM(`expected_payout`)
AS `expected_payout`, SUM(`bid_price`) AS `bid_price`, SUM(`noads`) AS `noads`, SUM(`e`) AS
`e`, SUM(`g`) AS `g`, SUM(`publisher_revenue`) AS `publisher_revenue`, SUM(`dmp_liveramp_margin`)
AS `dmp_liveramp_margin`, SUM(`actual_pgm_fee`) AS `actual_pgm_fee`, SUM(`dmp_rapleaf_payout`)
AS `dmp_rapleaf_payout`, SUM(`dd_convs`) AS `dd_convs`  FROM `slicer`.`573_slicer_rnd_13`
 WHERE dt = '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30; 
>>
>>(results for three runs) 
>>30 rows selected (2.158 seconds) 
>>30 rows selected (1.83 seconds) 
>>30 rows selected (1.979 seconds) 
>>
>>Sergei Romanov.
Sergei Romanov
Mime
View raw message