spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej <>
Subject Re: [Spark SQL]: SQL, Python, Scala and R API Consistency
Date Thu, 28 Jan 2021 21:09:42 GMT
Just my two cents on R side.

On 1/28/21 10:00 PM, Nicholas Chammas wrote:
> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <
> <>> wrote:
>     It isn't that regexp_extract_all (for example) is useless outside
>     SQL, just, where do you draw the line? Supporting 10s of random
>     SQL functions across 3 other languages has a cost, which has to be
>     weighed against benefit, which we can never measure well except
>     anecdotally: one or two people say "I want this" in a sea of
>     hundreds of thousands of users.
> +1 to this, but I will add that Jira and Stack Overflow activity can
> sometimes give good signals about API gaps that are frustrating users.
> If there is an SO question with 30K views about how to do something
> that should have been easier, then that's an important signal about
> the API.
>     For this specific case, I think there is a fine argument
>     that regexp_extract_all should be added simply for consistency
>     with regexp_extract. I can also see the argument
>     that regexp_extract was a step too far, but, what's public is now
>     a public API.
> I think in this case a few references to where/how people are having
> to work around missing a direct function for regexp_extract_all could
> help guide the decision. But that itself means we are making these
> decisions on a case-by-case basis.
> From a user perspective, it's definitely conceptually simpler to have
> SQL functions be consistent and available across all APIs.
> Perhaps if we had a way to lower the maintenance burden of keeping
> functions in sync across SQL/Scala/Python/R, it would be easier for
> everyone to agree to just have all the functions be included across
> the board all the time.

Python aligns quite well with Scala so that might be fine, but R is a
bit tricky thing. Especially lack of proper namespaces makes it rather
risky to have packages that export hundreds of functions. sparkly
handles this neatly with NSE, but I don't think we're going to go this way.

> Would, for example, some sort of automatic testing mechanism for SQL
> functions help here? Something that uses a common function testing
> specification to automatically test SQL, Scala, Python, and R
> functions, without requiring maintainers to write tests for each
> language's version of the functions. Would that address the
> maintenance burden?

With R we don't really test most of the functions beyond the simple
"callability". One the complex ones, that require some non-trivial
transformations of arguments, are fully tested.

Best regards,
Maciej Szymkiewicz


View raw message