spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: [Spark SQL]: SQL, Python, Scala and R API Consistency
Date Thu, 28 Jan 2021 21:00:06 GMT
On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <srowen@gmail.com> wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

+1 to this, but I will add that Jira and Stack Overflow activity can
sometimes give good signals about API gaps that are frustrating users. If
there is an SO question with 30K views about how to do something that
should have been easier, then that's an important signal about the API.

For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument that regexp_extract was a
> step too far, but, what's public is now a public API.
>

I think in this case a few references to where/how people are having to
work around missing a direct function for regexp_extract_all could help
guide the decision. But that itself means we are making these decisions on
a case-by-case basis.

>From a user perspective, it's definitely conceptually simpler to have SQL
functions be consistent and available across all APIs.

Perhaps if we had a way to lower the maintenance burden of keeping
functions in sync across SQL/Scala/Python/R, it would be easier for
everyone to agree to just have all the functions be included across the
board all the time.

Would, for example, some sort of automatic testing mechanism for SQL
functions help here? Something that uses a common function testing
specification to automatically test SQL, Scala, Python, and R functions,
without requiring maintainers to write tests for each language's version of
the functions. Would that address the maintenance burden?

Mime
View raw message