spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Melo <>
Subject Re: DSv2 reader lifecycle
Date Thu, 07 Nov 2019 01:48:21 GMT
Hi Ryan,

Thanks for the pointers

On Thu, Nov 7, 2019 at 8:13 AM Ryan Blue <> wrote:

> Hi Andrew,
> This is expected behavior for DSv2 in 2.4. A separate reader is configured
> for each operation because the configuration will change. A count, for
> example, doesn't need to project any columns, but a count distinct will.
> Similarly, if your read has different filters we need to apply those to a
> separate reader for each scan.

Ah, I presumed that the interaction was slightly different, there was a
single reader configured and (e.g.) pruneSchema was called repeatedly to
change the desired output schema. I guess for 2.4 it's best for me to
cache/memoize the metadata for paths/files to keep them from being
repeatedly calculated.

> The newer API that we are releasing in Spark 3.0 addresses the concern
> that each reader is independent by using Catalog and Table interfaces. In
> the new version, Spark will load a table by name from a persistent catalog
> (loaded once) and use the table to create a reader for each operation. That
> way, you can load common metadata in the table, cache the table, and pass
> its info to readers as they are created.

That's good to know, I'll search around JIRA for docs describing that

Thanks again,

> rb
> On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo <> wrote:
>> Hello,
>> During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears
>> that our DataSourceReader is being instantiated multiple times for the same
>> dataframe. For example, the following snippet
>>         Dataset<Row> df = spark
>>                 .read()
>>                 .format("edu.vanderbilt.accre.laurelin.Root")
>>                 .option("tree",  "Events")
>>                 .load("testdata/pristine/2018nanoaod1june2019.root");
>> Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls
>> createReader once (as an aside, this seems like a lot for 1000 columns?
>> "CodeGenerator: Code generated in 8162.847517 ms")
>> but then running operations on that dataframe (e.g. df.count()) calls
>> createReader for each call, instead of holding the existing
>> DataSourceReader.
>> Is that the expected behavior? Because of the file format, it's quite
>> expensive to deserialize all the various metadata, so I was holding the
>> deserialized version in the DataSourceReader, but if Spark is repeatedly
>> constructing new ones, then that doesn't help. If this is the expected
>> behavior, how should I handle this as a consumer of the API?
>> Thanks!
>> Andrew
> --
> Ryan Blue
> Software Engineer
> Netflix

View raw message