hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <>
Subject [jira] [Commented] (HIVE-12316) Improved integration test for Hive
Date Wed, 04 Nov 2015 20:58:27 GMT


Alan Gates commented on HIVE-12316:

Sure.  For an idea of what tests look like take a look in the examples package under test.
 This contains examples for both queries and explain.  The qconverted package has a direct
translation of skewjoin_union_remove_[12].q

In terms of similarities with qfiles, this seeks to support tests for explain, SQL queries,
and and allow developers to set config values, etc.

For differences, there are several that are key:
* A single test can be written and used to test various combinations of Hive features (security
on/off, access via cli/jdbc, metastore in rdbms/hbase, etc.).
* The tests can be run on a users laptop with a few kilobytes of data or a cluster with up
to terabytes of data.  Scaling and expected results generation are handled.
* This runs all in JUnit and Java.  There's no need for a separate infrastructure (QTestUtils
et al)
* No more golden files.  For most queries expected results can be generated by the framework.
 For explain the plan can be accessed programmatically rather than relying on string comparison.

Will we need to translate qfiles?  Maybe.  In the long run it won't make sense to have both
this and the qfile infrastructure, as this aims to do everything that qfiles can do and much
more.  But given that this is brand new it has to mature quite a bit and the community has
to adopt it.  It's not time to start a whole sale translation.  I've looked at building a
qfile translator but I'm not convinced it's a great idea.  One, because I think this will
allow us to build fewer overall tests than we have qfile tests.  Two, qfile tests mix explain
(which are really tests for the optimizer) and queries.  I'm wondering if it doesn't make
sense to split these out.

This is all in JUnit, so adding new tests is straight forward.  One important thing is that
there's a fair amount of setup and teardown for each test file, so it's good to combine like
tests into a single class (e.g. one might put all of the ACID insert/update/delete tests together)
rather than have a single test per class.

> Improved integration test for Hive
> ----------------------------------
>                 Key: HIVE-12316
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Testing Infrastructure
>    Affects Versions: 2.0.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: HIVE-12316.patch
> In working with Hive testing I have found there are several issues that are causing problems
for developers, testers, and users:
> * Because Hive has many tunable knobs (file format, security, etc.) we end up with tests
that cover the same functionality with different permutations of these features.
> * The Hive integration tests (ie qfiles) cannot be run on a cluster.  This means we cannot
run any of those tests at scale.  The HBase community by contrast uses the same test suite
locally and on a cluster, and has found that this helps them greatly in testing.
> * Golden files are a grievous evil.  Test writers are forced to eyeball results the first
time they run a test and decide whether they look reasonable, which is error prone and makes
testing at scale impossible.  And changes to one part of Hive often end up changing the plan
(and the output of explain) thus breaking many tests that are not related.  This is particularly
an issue for people working on the optimizer.  
> * The lack of ability to run on a cluster means that when people test Hive at scale,
they are forced to develop custom frameworks which can't then benefit the community.
> * There is no easy mechanism to bring user queries into the test suite.
> I propose we build a new testing capability with the following requirements:
> * One test should be able to run all reasonable permutations (mr/tez/spark, orc/parquet/text/rcfile,
secure/non-secure etc.)  This doesn't mean it would run every permutation every time, but
that the tester could choose which permutation to run.
> * The same tests should run locally and on a cluster.  The tests should support scaling
of input data from Ks to Ts.
> * Expected results should be auto-generated whenever possible, and this should work with
the scaling of inputs.  The dev should be able to provide expected results or custom expected
result generation in cases where auto-generation doesn't make sense.
> * Access to the query plan should be available as an API in the tests so that golden
files of explain output are not required.
> * This should run in maven, junit, and java so that developers do not need to manage
yet another framework.
> * It should be possible to simulate user data (based on schema and statistics) and quickly
incorporate user queries so that tests from user scenarios can be quickly incorporated.

This message was sent by Atlassian JIRA

View raw message