spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Spitzer <>
Subject Re: [DISCUSS] SPIP: APIs for Table Metadata Operations
Date Wed, 05 Sep 2018 04:18:01 GMT
They are based on a physical column, the column is real. The function just
only exists in the datasource.

For example

Select ttl(a), ttl(b) FROM table

On Tue, Sep 4, 2018 at 11:16 PM Reynold Xin <> wrote:

> Russell your special columns wouldn’t actually work with option 1 because
> Spark would have to fail them in analysis without an actual physical
> column.
> On Tue, Sep 4, 2018 at 9:12 PM Russell Spitzer <>
> wrote:
>> I'm a big fan of 1 as well. I had to implement something similar using
>> custom expressions and it was a bit more work than it should be. In
>> particular our use case is that columns have certain metadata (ttl,
>> writetime) which exist not as separate columns but as special values which
>> can be surfaced.
>> I still don't have a good solution for the same thing at write-time
>> though since the problem is a bit asymmetric for us. While you can read a
>> metadata from any particular cell, on write you specify it for the whole
>> row.
>> On Tue, Sep 4, 2018 at 11:04 PM Ryan Blue <>
>> wrote:
>>> Thanks for posting the summary. I'm strongly in favor of option 1.
>>> I think that API footprint is fairly small, but worth it. Not only does
>>> it make sources easier to implement by handling parsing, it also makes
>>> sources more reliable because Spark handles validation the same way across
>>> sources.
>>> A good example is making sure that the referenced columns exist in the
>>> table, which should be done using the case sensitivity of the analyzer.
>>> Spark would pass normalized column names that match the case of the
>>> declared columns to ensure that there isn't a problem if Spark is case
>>> insensitive but the source doesn't implement it. And the source wouldn't
>>> have to know about Spark's case sensitivity settings at all.
>>> On Tue, Sep 4, 2018 at 7:46 PM Reynold Xin <> wrote:
>>>> Ryan, Michael and I discussed this offline today. Some notes here:
>>>> His use case is to support partitioning data by derived columns, rather
>>>> than physical columns, because he didn't want his users to keep adding the
>>>> "date" column when in reality they are purely derived from some timestamp
>>>> column. We reached consensus on this is a great use case and something we
>>>> should support.
>>>> We are still debating how to do this at API level. Two options:
>>>> *Option 1.* Create a smaller surfaced, parallel Expression library,
>>>> and use that for specifying partition columns. The bare minimum class
>>>> hierarchy would look like:
>>>> trait Expression
>>>> class NamedFunction(name: String, args: Seq[Expression]) extends
>>>> Expression
>>>> class Literal(value: Any) extends Expression
>>>> class ColumnReference(name: String) extends Expression
>>>> These classes don't define how the expressions are evaluated, and it'd
>>>> be up to the data sources to interpret them. As an example, for a table
>>>> partitioned by date(ts), Spark would pass the following to the underlying
>>>> ds:
>>>> NamedFunction("date", ColumnReference("timestamp") :: Nil)
>>>> *Option 2.* Spark passes strings over to the data sources. For the
>>>> above example, Spark simply passes "date(ts)" as a string over.
>>>> The pros/cons of 1 vs 2 are basically the inverse of each other. Option
>>>> 1 creates more rigid structure, with extra complexity in API design. Option
>>>> 2 is less structured but more flexible. Option 1 gives Spark the
>>>> opportunity to enforce column references are valid (but not the actual
>>>> function names), whereas option 2 would be up to the data sources to
>>>> validate.
>>>> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <> wrote:
>>>>> I think I found a good solution to the problem of using Expression in
>>>>> the TableCatalog API and in the DeleteSupport API.
>>>>> For DeleteSupport, there is already a stable and public subset of
>>>>> Expression named Filter that can be used to pass filters. The reason
>>>>> DeleteSupport would use Expression is to support more complex expressions
>>>>> like to_date(ts) = '2018-08-15' that are translated to ts >=
>>>>> 1534316400000000 AND ts < 1534402800000000. But, this can be done
>>>>> Spark instead of the data sources so I think DeleteSupport should use
>>>>> Filter instead. I updated the DeleteSupport PR #21308
>>>>> <> with these changes.
>>>>> Also, I agree that the DataSourceV2 API should also not expose
>>>>> Expression, so I opened SPARK-25127 to track removing
>>>>> SupportsPushDownCatalystFilter
>>>>> <>.
>>>>> For TableCatalog, I took a similar approach instead of introducing a
>>>>> parallel Expression API. Instead, I created a PartitionTransform API
>>>>> Filter) that communicates the transformation function, function parameters
>>>>> like num buckets, and column references. I updated the TableCatalog
>>>>> PR #21306 <> to use
>>>>> PartitionTransform instead of Expression and I updated the text of the
>>>>> doc
>>>>> <>
>>>>> .
>>>>> I also raised a concern about needing to wait for Spark to add support
>>>>> for new expressions (now partition transforms). To get around this, I
>>>>> an apply transform that passes the name of a function and an input
>>>>> column. That way, users can still pass transforms that Spark doesn’t
>>>>> about by name to data sources: apply("source_function", "colName").
>>>>> Please have a look at the updated pull requests and SPIP doc and
>>>>> comment!
>>>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix

View raw message