spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Re: Plans for improved Spark DataFrame/Dataset unit testing?
Date Sun, 21 Aug 2016 16:30:44 GMT
On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr <spark@bedryt.ski>
wrote:

> Hi,
>
> we share the same spark/hive context between tests (executed in
> parallel), so the main problem is that the temporary tables are
> overwritten each time they are created, this may create race conditions
> as these tempTables may be seen as global mutable shared state.
>
> So each time we create a temporary table, we add an unique, incremented,
> thread safe id (AtomicInteger) to its name so that there are only
> specific, non-shared temporary tables used for a test.
>

Makes sense.

But when you say you're sharing the same spark/hive context between tests,
I'm assuming that's between the same tests within one test class, but
you're not sharing across test classes (which a build tool like Maven or
Gradle might have executed in separate JVMs).

Is that right?




>
> --
>   Bedrytski Aliaksandr
>   spark@bedryt.ski
>
>
>
> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
> <holden@pigscanfly.ca> wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
> <koert@tresata.com> wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
> <everett@nuna.com.invalid> wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

Mime
View raw message