spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)
Date Thu, 16 Jan 2020 10:55:54 GMT
Hi all,

I would like to suggest to take one step back at
https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have
enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
- https://github.com/apache/spark/pull/24117#issuecomment-568891483
- https://github.com/apache/spark/pull/24117#issuecomment-568614961
- https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*


So people can write some classes for *transform(col)* for partitioned
column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now
APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take
one step back and revert it for now.
Please see below the concerns listed.

*Duplication of existing expressions*
Seems like existing expressions are going to be duplicated. See below new
APIs added:

def years(column: String): YearsTransform = YearsTransform(reference(column))
def months(column: String): MonthsTransform = MonthsTransform(reference(column))
def days(column: String): DaysTransform = DaysTransform(reference(column))
def hours(column: String): HoursTransform = HoursTransform(reference(column))
...

It looks like it requires to add a copy of our existing expressions, in the
future.


*Limited Extensibility*
It has a clear limitation. It looks other expressions are going to be
allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just
directly maps transformName to implementation class,
and just pass arguments:

transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (','
argument+=transformArgument)* ')'  #applyTransform
    ;

It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

*Limited Compatibility Management*
The name can be arbitrary. For instance, if "transform" is supported in
Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.

Mime
View raw message