spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 大啊 <belie...@163.com>
Subject Re:Re: [Spark SQL]: SQL, Python, Scala and R API Consistency
Date Wed, 03 Feb 2021 09:48:08 GMT
+1 for this work!
I agree with some clarifications below:
SQL is a somewhat different case. There are functions that aren't _that_ useful in general,
kind of niche, but nevertheless exist in other SQL systems, most notably Hive. It's useful
to try to expand SQL support to cover those to ease migration and interoperability. But it
may not make enough sense to maintain those functions in Scala, and Python, and R, because
they're niche.







At 2021-01-29 04:40:07, "Sean Owen" <srowen@gmail.com> wrote:

I think I can articulate the general idea here, though I expect it is not deployed consistently.


Yes there's a general desire to make APIs consistent across languages. Python and Scala should
track pretty closely, even if R isn't really that consistent.


SQL is a somewhat different case. There are functions that aren't _that_ useful in general,
kind of niche, but nevertheless exist in other SQL systems, most notably Hive. It's useful
to try to expand SQL support to cover those to ease migration and interoperability. But it
may not make enough sense to maintain those functions in Scala, and Python, and R, because
they're niche.


I think that was what you saw with regexp_extract_all. As you can see there isn't perfect
agreement on where to draw those lines. But I think the theory has been mostly consistent
over time, if not the execution.


It isn't that regexp_extract_all (for example) is useless outside SQL, just, where do you
draw the line? Supporting 10s of random SQL functions across 3 other languages has a cost,
which has to be weighed against benefit, which we can never measure well except anecdotally:
one or two people say "I want this" in a sea of hundreds of thousands of users.


(I'm not sure about CalendarType - just I know that date/time types are hard even within,
say, the JVM, let alone across languages)


For this specific case, I think there is a fine argument that regexp_extract_all should be
added simply for consistency with regexp_extract. I can also see the argument that regexp_extract
was a step too far, but, what's public is now a public API.


I'll also say that the cost of adding API functions grows as a project matures, and whereas
it might have made sense to add this at an earlier time, it might not make sense now.


I come out neutral on this specific case, but would not override the opinion of other committers.
But I hope that explains the logic that I think underpins what you're hearing.








On Thu, Jan 28, 2021 at 2:23 PM MrPowers <matthewkevinpowers@gmail.com> wrote:

Thank you all for your amazing work on this project.  Spark has a great
public interface and the source code is clean.  The core team has done a
great job building and maintaining this project.  My emails / GitHub
comments focus on the 1% that we might be able to improve.

Pull requests / suggestions for improvements can come across as negative,
but I'm nothing but happy & positive about this project.  The source code is
delightful to read and the internal abstractions are beautiful.

*API consistency*

The SQL, Scala, and Python APIs are generally consistent.  They all have a
reverse function for example.

Some of the new PRs have arguments against consistent rollout of functions
across the APIs.  This seems like a break in the traditional Spark
development process when functions were implemented in all APIs (except for
functions that only make sense for certain APIs like createDataset and
toDS).

The default has shifted from consistent application of function across APIs
to "case by case determination".

*Examples*

* The regexp_extract_all function was recently added to the SQL API.  It was
then added to the Scala API,  but then removed from the Scala API
<https://github.com/apache/spark/pull/31346>  .

* There is an ongoing discussion on  if CalendarType will be added to the
Python API <https://github.com/apache/spark/pull/29935> 

*Arguments against adding functions like regexp_extract_all to the Scala
API:*

* Some of these functions are SQL specific and don't make sense for the
other languages

* Scala users can access the SQL functions via expr

*Argument rebuttal*

I don't understand the "some of the functions are SQL specific argument".
regexp_extract_all fills a gap in the API.  Users have been forced to use
UDF workarounds for this in the past.  Users from all APIs need this
solution. 

Using expr isn't developer friendly.  Scala / Python users don't want to
manipulate SQL strings.  Nesting functions in SQL strings is complicated.
The quoting and escaping is all different.  Figuring out how to invoke
regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
would need to figure out SQL quoting, SQL escaping, and how to access column
names instead of a column object.

Any of the org.apache.spark.sql.functions can be invoked via expr.  The core
reason the Scala/Python APIs exist is so that developers don't need to
manipulate strings for expr.

regexp_extract_all should be added to the Scala API for the same reasons
that regexp_extract was added to the Scala API. 

*Next steps*

* I'd like to better understand why we've broken from the traditional Spark
development process of "consistently implementing functions across all APIs"
to "selectively implementing functions in certain APIs"

* Hopefully shift the burden of proof to those in favor of inconsistent
application.  Consistent application should be the default. 

Thank you all for your excellent work on this project.

- Matthew Powers (GitHub: MrPowers)



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Mime
View raw message