spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
Date Fri, 07 Oct 2016 08:00:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554459#comment-15554459
] 

Reynold Xin edited comment on SPARK-17626 at 10/7/16 7:59 AM:
--------------------------------------------------------------

[~ioana-delaney] Thanks for commenting.

1. I think specialized join operator makes sense for the volcano style (or column batch) operators,
but I am not sure if they are still necessary with whole stage code generation. For example,
a special star join operator would just look like a sequence of broadcast joins with whole
stage code generation (since the rows are never materialized -- although I do recognize there
are things that can be done to improve the code generated plan even more, e.g. by injecting
a bloom filter to make the efficient robust against bad join ordering).

2.  Can you comment on the reasons why this is still useful with a cost-based optimizer, from
first principle? So far your argument is mostly many RDBMSes have both and they are complementary.
It would be great to comment on why they are complementary.


was (Author: rxin):
[~ioana-delaney] Thanks for commenting.

1. I think specialized join operator makes sense for the volcano style (or column batch) operators,
but I am not sure if they are still necessary with whole stage code generation. For example,
a special star join operator would just look like a sequence of broadcast joins with whole
stage code generation (since the rows are never materialized).

2.  Can you comment on the reasons why this is still useful with a cost-based optimizer, from
first principle? So far your argument is mostly many RDBMSes have both and they are complementary.
It would be great to comment on why they are complementary.

> TPC-DS performance improvements using star-schema heuristics
> ------------------------------------------------------------
>
>                 Key: SPARK-17626
>                 URL: https://issues.apache.org/jira/browse/SPARK-17626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Priority: Critical
>         Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions
linking to dimensions. A star schema consists of a fact table referencing a number of dimension
tables. Fact table holds the main data about a business. Dimension table, a usually smaller
table, describes data reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of sub-optimal
execution plans of large fact tables joins. Manual rewrite of some of the queries into selective
fact-dimensions joins resulted in significant performance improvement. This prompted us to
develop a simple join reordering algorithm based on star schema detection. The performance
testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed                 99
> Failed                  0
> Total q time (s)   14,962
> Max time            1,467
> Min time                3
> Mean time             145
> Geomean                44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)              -19%	
> Mean time improved (%)               -19%
> Geomean improved (%)                 -24%
> End to end improved (seconds)      -3,603
> Number of queries improved (>10%)      45
> Number of queries degraded (>10%)       6
> Number of queries unchanged            48
> Top 10 queries improved (%)          -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base tables.
Selectivity hint is optional and it was not used in the above TPC-DS tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message