spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Sun, 21 Apr 2019 00:25:32 GMT
Okay, that makes sense, but is the Arrow data format stable? If not, we risk breakage when
Arrow changes in the future and some libraries using this feature are begin to use the new
Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans <revans2@gmail.com> wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow APIs/Classes through
any Spark APIs.  SPARK-24579 is doing that, and because of the overlap between the two SPIPs
I scaled this one back to concentrate just on the columnar processing aspects. Sorry for the
confusion as I didn't update the JIRA description clearly enough when we adjusted it during
the discussion on the JIRA.  As part of the columnar processing, we plan on providing arrow
formatted data, but that will be exposed through a Spark owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei.zaharia@gmail.com> wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public API if
it’s not yet stable. Is stabilization of the API and format coming soon on the roadmap there?
Maybe someone can work with the Arrow community to make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries even when
those were widely popular. For example, we used Guava’s Optional for a while, which changed
at some point, and we also had issues with Protobuf and Scala itself (especially how Scala’s
APIs appear in Java). API breakage might not be as serious in dynamic languages like Python,
where you can often keep compatibility with old behaviors, but it really hurts in Java and
Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is used:
> 
> 1) Spark is used for production data transformation jobs that people need to keep running
for a long time. Nobody wants to make changes to a job that’s been working fine and computing
something correctly for years just to get a bug fix from the latest Spark release or whatever.
It’s much better if they can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other libraries,
and these might start to require different versions of our dependencies. For example, the
Guava class exposed in Spark became a problem when third-party libraries started requiring
a new version of Guava: those new libraries just couldn’t work with Spark. Protobuf was
especially bad because some users wanted to read data stored as Protobufs (or in a format
that uses Protobuf inside), so they needed a different version of the library in their main
data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, we’d
be in a much better stuff. It’s not that hard to keep a storage format backward-compatible:
just document the format and extend it only in ways that don’t break the meaning of old
data (for example, add new version numbers or field types that are read in a different way).
It’s a bit harder for a Java API, but maybe Spark could just expose byte arrays directly
and work on those if the API is not guaranteed to stay stable (that is, we’d still use our
own classes to manipulate the data internally, and end users could use the Arrow library if
they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans <revans2@gmail.com> wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your comments in
the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <mengxr@gmail.com> wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release
someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the data
exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, SPARK-26412
(UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM <tcondie@gmail.com> wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji <dmatrix@comcast.net> 
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler <cutlerb@gmail.com>
> > Cc: Dev <dev@spark.apache.org>
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing
Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutlerb@gmail.com> wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jlowe@apache.org> wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing columnar
data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves <tgraves_cs@yahoo.com.invalid>
wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar
Processing Support.  The proposal is to extend the support to allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: https://issues.apache.org/jira/browse/SPARK-27396.
There was also a DISCUSS thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next Monday (the
22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message