spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Mon, 27 May 2019 20:25:11 GMT

I’d prefer to see more of the end goal and how that could be achieved (such as ETL or SPARK-24579).
However given the rounds and months of discussions we have come down to just the public API.

If the community thinks a new set of public API is maintainable, I don’t see any problem
with that.

From: Tom Graves <>
Sent: Sunday, May 26, 2019 8:22:59 AM
To:; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei Zaharia; Thomas graves;
Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

More feedback would be great, this has been open a long time though, let's extend til Wednesday
the 29th and see where we are at.


Sent from Yahoo Mail on Android<>

On Sat, May 25, 2019 at 6:28 PM, Holden Karau
<> wrote:
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <<>>
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <<>>


On Fri, May 24, 2019 at 17:03 DB Tsai <> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.


DB Tsai
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <<>>
> It would allow for the columnar processing to be extended through the shuffle.  So if
I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with
one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec
could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD
for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing
and deserializing happen in a column based format for a faster exchange, assuming that columnar
processing is also happening after the exchange. This is just like providing a columnar version
of any other catalyst operator, except in this case it is a bit more complex of an operator.
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <> wrote:
>> sorry I am late to the discussion here -- the jira mentions using this extensions
for dealing with shuffles, can you explain that part?  I don't see how you would use this
to change shuffle behavior at all.
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <<>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>> Tom
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <<>>
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but
I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <<>>
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't
feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check.
Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People
who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match
the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it
won't bring much value.
>>> >>
>>> > --
>>> > Twitter:
>>> > Books (Learning Spark, High Performance Spark, etc.):
>>> > YouTube Live Streams:
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail:<>

To unsubscribe e-mail:<>

Books (Learning Spark, High Performance Spark, etc.): <>
YouTube Live Streams:

View raw message