spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: DSv2 reader lifecycle
Date Wed, 06 Nov 2019 21:42:59 GMT
Hi Andrew,

This is expected behavior for DSv2 in 2.4. A separate reader is configured
for each operation because the configuration will change. A count, for
example, doesn't need to project any columns, but a count distinct will.
Similarly, if your read has different filters we need to apply those to a
separate reader for each scan.

The newer API that we are releasing in Spark 3.0 addresses the concern that
each reader is independent by using Catalog and Table interfaces. In the
new version, Spark will load a table by name from a persistent catalog
(loaded once) and use the table to create a reader for each operation. That
way, you can load common metadata in the table, cache the table, and pass
its info to readers as they are created.

rb

On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo <andrew.melo@gmail.com> wrote:

> Hello,
>
> During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that
> our DataSourceReader is being instantiated multiple times for the same
> dataframe. For example, the following snippet
>
>         Dataset<Row> df = spark
>                 .read()
>                 .format("edu.vanderbilt.accre.laurelin.Root")
>                 .option("tree",  "Events")
>                 .load("testdata/pristine/2018nanoaod1june2019.root");
>
> Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls
> createReader once (as an aside, this seems like a lot for 1000 columns?
> "CodeGenerator: Code generated in 8162.847517 ms")
>
> but then running operations on that dataframe (e.g. df.count()) calls
> createReader for each call, instead of holding the existing
> DataSourceReader.
>
> Is that the expected behavior? Because of the file format, it's quite
> expensive to deserialize all the various metadata, so I was holding the
> deserialized version in the DataSourceReader, but if Spark is repeatedly
> constructing new ones, then that doesn't help. If this is the expected
> behavior, how should I handle this as a consumer of the API?
>
> Thanks!
> Andrew
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message