spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Powers <matthewkevinpow...@gmail.com>
Subject Re: [Spark SQL]: SQL, Python, Scala and R API Consistency
Date Sat, 30 Jan 2021 15:45:30 GMT
Maciej - I like the idea of a separate library to provide easy access to
functions that the maintainers don't want to merge into Spark core.

I've seen this model work well in other open source communities.  The Rails
Active Support library provides the Ruby community with core functionality
like beginning_of_month.  The Ruby community has a good, well-supported
function, but it's not in the Ruby codebase so it's not a maintenance
burden - best of both worlds.

I'll start a proof-of-concept repo.  If the repo gets popular, I'll be
happy to donate it to a GitHub organization like Awesome Spark
<https://github.com/awesome-spark> or the ASF.

On Sat, Jan 30, 2021 at 9:35 AM Maciej <mszymkiewicz@gmail.com> wrote:

> Just thinking out loud ‒ if there is community need for providing language
> bindings for less popular SQL functions, could these live outside main
> project or even outside the ASF?  As long as expressions are already
> implemented, bindings are trivial after all.
>
> If could also allow usage of more scalable hierarchy (let's say with
> modules / packages per function family).
>
> On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
>
> FYI exposing methods with Column signature only is already documented on
> the top of functions.scala, and I believe that has been the current dev
> direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used expressions.
> Its best if it considers language specific context. Many of expressions are
> for SQL compliance. Many data silence python libraries don't support such
> features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers, <matthewkevinpowers@gmail.com>
> wrote:
>
>> Thanks for the thoughtful responses.  I now understand why adding all the
>> functions across all the APIs isn't the default.
>>
>> To Nick's point, relying on heuristics to gauge user interest, in
>> addition to personal experience, is a good idea.  The regexp_extract_all
>> SO thread has 16,000 views
>> <https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,
>> so I say we set the threshold to 10k, haha, just kidding!  Like Sean
>> mentioned, we don't want to add niche functions.  Now we just need a way to
>> figure out what's niche!
>>
>> To Reynolds point on overloading Scala functions, I think we should start
>> trying to limit the number of overloaded functions.  Some functions have
>> the columnName and column object function signatures.  e.g.
>> approx_count_distinct(columnName: String, rsd: Double) and
>> approx_count_distinct(e: Column, rsd: Double).  We can just expose the
>> approx_count_distinct(e: Column, rsd: Double) variety going forward (not
>> suggesting any backwards incompatible changes, just saying we don't need
>> the columnName-type functions for new stuff).
>>
>> Other functions have one signature with the second object as a Scala
>> object and another signature with the second object as a column object,
>> e.g. date_add(start: Column, days: Column) and date_add(start: Column,
>> days: Int).  We can just expose the date_add(start: Column, days: Column)
>> variety cause it's general purpose.  Let me know if you think that avoiding
>> Scala function overloading will help Reynold.
>>
>> Let's brainstorm Nick's idea of creating a framework that'd test Scala /
>> Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
>> great way to reduce the maintenance burden.  Reynold's regexp_extract code
>> from 5 years ago is largely still intact - getting the job done right the
>> first time is another great way to avoid maintenance!
>>
>> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin <rxin@databricks.com> wrote:
>>
>>> There's another thing that's not mentioned … it's primarily a problem
>>> for Scala. Due to static typing, we need a very large number of function
>>> overloads for the Scala version of each function, whereas in SQL/Python
>>> they are just one. There's a limit on how many functions we can add, and it
>>> also makes it difficult to browse through the docs when there are a lot of
>>> functions.
>>>
>>>
>>>
>>> On Thu, Jan 28, 2021 at 1:09 PM, Maciej <mszymkiewicz@gmail.com> wrote:
>>>
>>>> Just my two cents on R side.
>>>>
>>>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>>>
>>>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>>>> across 3 other languages has a cost, which has to be weighed against
>>>>> benefit, which we can never measure well except anecdotally: one or two
>>>>> people say "I want this" in a sea of hundreds of thousands of users.
>>>>>
>>>>
>>>> +1 to this, but I will add that Jira and Stack Overflow activity can
>>>> sometimes give good signals about API gaps that are frustrating users. If
>>>> there is an SO question with 30K views about how to do something that
>>>> should have been easier, then that's an important signal about the API.
>>>>
>>>> For this specific case, I think there is a fine argument
>>>>> that regexp_extract_all should be added simply for consistency
>>>>> with regexp_extract. I can also see the argument that regexp_extract
was a
>>>>> step too far, but, what's public is now a public API.
>>>>>
>>>>
>>>> I think in this case a few references to where/how people are having to
>>>> work around missing a direct function for regexp_extract_all could help
>>>> guide the decision. But that itself means we are making these decisions on
>>>> a case-by-case basis.
>>>>
>>>> From a user perspective, it's definitely conceptually simpler to have
>>>> SQL functions be consistent and available across all APIs.
>>>>
>>>> Perhaps if we had a way to lower the maintenance burden of keeping
>>>> functions in sync across SQL/Scala/Python/R, it would be easier for
>>>> everyone to agree to just have all the functions be included across the
>>>> board all the time.
>>>>
>>>> Python aligns quite well with Scala so that might be fine, but R is a
>>>> bit tricky thing. Especially lack of proper namespaces makes it rather
>>>> risky to have packages that export hundreds of functions. sparkly handles
>>>> this neatly with NSE, but I don't think we're going to go this way.
>>>>
>>>>
>>>> Would, for example, some sort of automatic testing mechanism for SQL
>>>> functions help here? Something that uses a common function testing
>>>> specification to automatically test SQL, Scala, Python, and R functions,
>>>> without requiring maintainers to write tests for each language's version
of
>>>> the functions. Would that address the maintenance burden?
>>>>
>>>> With R we don't really test most of the functions beyond the simple
>>>> "callability". One the complex ones, that require some non-trivial
>>>> transformations of arguments, are fully tested.
>>>>
>>>> --
>>>> Best regards,
>>>> Maciej Szymkiewicz
>>>>
>>>> Web: https://zero323.net
>>>> Keybase: https://keybase.io/zero323
>>>> Gigs: https://www.codementor.io/@zero323
>>>> PGP: A30CEF0C31A501EC
>>>>
>>>>
>>>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>

Mime
View raw message