spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Gekk <maxim.g...@databricks.com>
Subject Re: Random sampling in tests
Date Mon, 08 Oct 2018 15:36:36 GMT
Hi All,

I believe we should also take into account what we test, for example, I
don't think it makes sense to check all timezones for JSON/CSV
functions/datasources because those timezones are just passed to external
libraries. So, the same code is involved into testing of each out of 650
timezones. We basically just spend time and resources on testing the
external libraries.

I mean the PRs: https://github.com/apache/spark/pull/22657 and
https://github.com/apache/spark/pull/22379#discussion_r223039662

Maxim Gekk

Technical Solutions Lead

Databricks B. V.  <http://databricks.com/>


On Mon, Oct 8, 2018 at 4:49 PM Sean Owen <srowen@gmail.com> wrote:

> If the problem is simply reducing the wall-clock time of tests, then
> even before we get to this question, I'm advocating:
>
> 1) try simple parallelization of tests within the suite. In this
> instance there's no reason not to test these in parallel and get a 8x
> or 16x speedup from cores. This assumes, I believe correctly, that the
> machines aren't generally near 100% utilization
> 2) explicitly choose a smaller, more directed set of cases to test
>
> Randomly choosing test cases with a fixed seed is basically 2, but not
> choosing test cases for a particular reason. You can vary the seed but
> as a rule the same random subset of tests is always chosen. Could be
> fine if there's no reason at all to prefer some cases over others. But
> I am guessing any wild guess at the most important subset of cases to
> test is better than random.
>
> I'm trying 1) right now instead in these several cases.
> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li <lixiao@databricks.com> wrote:
> >
> > For this specific case, I do not think we should test all the timezone.
> If this is fast, I am fine to leave it unchanged. However, this is very
> slow. Thus, I even prefer to reducing the tested timezone to a smaller
> number or just hardcoding some specific time zones.
> >
> > In general, I like Reynold’s idea by including the seed value and we add
> the seed name in the test case name. This can help us reproduce it.
> >
> > Xiao
> >
> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <rxin@databricks.com> wrote:
> >>
> >> I'm personally not a big fan of doing it that way in the PR. It is
> perfectly fine to employ randomized tests, and in this case it might even
> be fine to just pick couple different timezones like the way it happened in
> the PR, but we should:
> >>
> >> 1. Document in the code comment why we did it that way.
> >>
> >> 2. Use a seed and log the seed, so any test failures can be reproduced
> deterministically. For this one, it'd be better to pick the seed from a
> seed environmental variable. If the env variable is not set, set to a
> random seed.
> >>
> >>
> >>
> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <srowen@gmail.com> wrote:
> >>>
> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
> >>> that tests a bunch of cases by randomly choosing different subsets of
> >>> cases to test on each Jenkins run.
> >>>
> >>> There's disagreement about whether this is good approach to improving
> >>> test runtime. Here's a discussion on one that was committed:
> >>> https://github.com/apache/spark/pull/22631/files#r223190476
> >>>
> >>> I'm flagging it for more input.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message