spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Powers <matthewkevinpow...@gmail.com>
Subject Re: [Spark SQL]: SQL, Python, Scala and R API Consistency
Date Fri, 29 Jan 2021 03:03:27 GMT
Thanks for the thoughtful responses.  I now understand why adding all the
functions across all the APIs isn't the default.

To Nick's point, relying on heuristics to gauge user interest, in
addition to personal experience, is a good idea.  The regexp_extract_all SO
thread has 16,000 views
<https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,
so I say we set the threshold to 10k, haha, just kidding!  Like Sean
mentioned, we don't want to add niche functions.  Now we just need a way to
figure out what's niche!

To Reynolds point on overloading Scala functions, I think we should start
trying to limit the number of overloaded functions.  Some functions have
the columnName and column object function signatures.  e.g.
approx_count_distinct(columnName: String, rsd: Double) and
approx_count_distinct(e: Column, rsd: Double).  We can just expose the
approx_count_distinct(e: Column, rsd: Double) variety going forward (not
suggesting any backwards incompatible changes, just saying we don't need
the columnName-type functions for new stuff).

Other functions have one signature with the second object as a Scala object
and another signature with the second object as a column object, e.g.
date_add(start: Column, days: Column) and date_add(start: Column, days:
Int).  We can just expose the date_add(start: Column, days: Column) variety
cause it's general purpose.  Let me know if you think that avoiding Scala
function overloading will help Reynold.

Let's brainstorm Nick's idea of creating a framework that'd test Scala /
Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
great way to reduce the maintenance burden.  Reynold's regexp_extract code
from 5 years ago is largely still intact - getting the job done right the
first time is another great way to avoid maintenance!

On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin <rxin@databricks.com> wrote:

> There's another thing that's not mentioned … it's primarily a problem for
> Scala. Due to static typing, we need a very large number of function
> overloads for the Scala version of each function, whereas in SQL/Python
> they are just one. There's a limit on how many functions we can add, and it
> also makes it difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej <mszymkiewicz@gmail.com> wrote:
>
>> Just my two cents on R side.
>>
>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <srowen@gmail.com> wrote:
>>
>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>> across 3 other languages has a cost, which has to be weighed against
>>> benefit, which we can never measure well except anecdotally: one or two
>>> people say "I want this" in a sea of hundreds of thousands of users.
>>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow activity can
>> sometimes give good signals about API gaps that are frustrating users. If
>> there is an SO question with 30K views about how to do something that
>> should have been easier, then that's an important signal about the API.
>>
>> For this specific case, I think there is a fine argument
>>> that regexp_extract_all should be added simply for consistency
>>> with regexp_extract. I can also see the argument that regexp_extract was a
>>> step too far, but, what's public is now a public API.
>>>
>>
>> I think in this case a few references to where/how people are having to
>> work around missing a direct function for regexp_extract_all could help
>> guide the decision. But that itself means we are making these decisions on
>> a case-by-case basis.
>>
>> From a user perspective, it's definitely conceptually simpler to have SQL
>> functions be consistent and available across all APIs.
>>
>> Perhaps if we had a way to lower the maintenance burden of keeping
>> functions in sync across SQL/Scala/Python/R, it would be easier for
>> everyone to agree to just have all the functions be included across the
>> board all the time.
>>
>> Python aligns quite well with Scala so that might be fine, but R is a bit
>> tricky thing. Especially lack of proper namespaces makes it rather risky to
>> have packages that export hundreds of functions. sparkly handles this
>> neatly with NSE, but I don't think we're going to go this way.
>>
>>
>> Would, for example, some sort of automatic testing mechanism for SQL
>> functions help here? Something that uses a common function testing
>> specification to automatically test SQL, Scala, Python, and R functions,
>> without requiring maintainers to write tests for each language's version of
>> the functions. Would that address the maintenance burden?
>>
>> With R we don't really test most of the functions beyond the simple
>> "callability". One the complex ones, that require some non-trivial
>> transformations of arguments, are fully tested.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>>
>

Mime
View raw message