flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jingsong Lee (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-15498) Using HiveCatalog in TPC-DS e2e
Date Tue, 07 Jan 2020 08:54:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-15498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009520#comment-17009520

Jingsong Lee commented on FLINK-15498:

Hi [~ykt836]

This ticket aims to make TPC-DS testing more productive.

Users who want to try Flink batch want to reproduce the benchmark scores of Flink batch. They
have the following difficulties:
 # Generate TPC-DS data. 
 # Prepare tables: create tables, write csv data to orc format, analysis tables.
 # Execute select query in flink batch.
 # Execute select query in hive/Tez/Spark/Presto.

For #1, we can only provide a little help, and users are more likely to view the official
TPC-DS documents.

But for #2, we can provide the preparation step of hive, user can easily reproduce in his
cluster with our e2e codes. And these tables can be read from other system too. As far as
I know, this step is troublesome, involving creating hive table with nullable and PKs and
orc compression and column types, tpc-ds origin data to orc tables, analysis tables.

And for #3, e2e should be exactly the same as benchmark. 

> Using HiveCatalog in TPC-DS e2e
> -------------------------------
>                 Key: FLINK-15498
>                 URL: https://issues.apache.org/jira/browse/FLINK-15498
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner, Tests
>            Reporter: Jingsong Lee
>            Priority: Major
>             Fix For: 1.11.0
> In 1.10, we have made great progress in the performance and function of batch. After
our internal test, the performance is significantly ahead of hive.
> But it's hard for users to reproduce. They need to have some research on TPC-DS to write
test code.
> We can consider changing the E2E test of TPC-DS to HiveCatalog, which is roughly divided
into two stages:
>  # The first stage is prepare of hive. Prepare the tables of TPC-DS. Insert the data
and prepare the metastore. And analysis the tables.
>  # The second stage is the analysis of Flink. Only select and check results.
> Users can play with it only by changing the data scale of the first stage.

This message was sent by Atlassian Jira

View raw message