spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Sat, 25 May 2019 22:28:14 GMT
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <rxin@databricks.com> wrote:

> Can we push this to June 1st? I have been meaning to read it but
> unfortunately keeps traveling...
>
> On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
> wrote:
>
>> +1
>>
>> Thanks,
>> Dongjoon.
>>
>> On Fri, May 24, 2019 at 17:03 DB Tsai <dbtsai@dbtsai.com.invalid> wrote:
>>
>>> +1 on exposing the APIs for columnar processing support.
>>>
>>> I understand that the scope of this SPIP doesn't cover AI / ML
>>> use-cases. But I saw a good performance gain when I converted data
>>> from rows to columns to leverage on SIMD architectures in a POC ML
>>> application.
>>>
>>> With the exposed columnar processing support, I can imagine that the
>>> heavy lifting parts of ML applications (such as computing the
>>> objective functions) can be written as columnar expressions that
>>> leverage on SIMD architectures to get a good speedup.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 42E5B25A8F7A82C1
>>>
>>> On Wed, May 15, 2019 at 2:59 PM Bobby Evans <revans2@gmail.com> wrote:
>>> >
>>> > It would allow for the columnar processing to be extended through the
>>> shuffle.  So if I were doing say an FPGA accelerated extension it could
>>> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
>>> input instead of a Row. The extended version of the ShuffleExchangeExec
>>> could then do the partitioning on the incoming batch and instead of
>>> producing a ShuffleRowRDD for the exchange they could produce something
>>> like a ShuffleBatchRDD that would let the serializing and deserializing
>>> happen in a column based format for a faster exchange, assuming that
>>> columnar processing is also happening after the exchange. This is just like
>>> providing a columnar version of any other catalyst operator, except in this
>>> case it is a bit more complex of an operator.
>>> >
>>> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid
>>> <irashid@cloudera.com.invalid> wrote:
>>> >>
>>> >> sorry I am late to the discussion here -- the jira mentions using
>>> this extensions for dealing with shuffles, can you explain that part?  I
>>> don't see how you would use this to change shuffle behavior at all.
>>> >>
>>> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves <tgraves@apache.org>
>>> wrote:
>>> >>>
>>> >>> Thanks for replying, I'll extend the vote til May 26th to allow
your
>>> >>> and other people feedback who haven't had time to look at it.
>>> >>>
>>> >>> Tom
>>> >>>
>>> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>> >>> >
>>> >>> > I’d like to ask this vote period to be extended, I’m interested
>>> but I don’t have the cycles to review it in detail and make an informed
>>> vote until the 25th.
>>> >>> >
>>> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <meng@databricks.com>
>>> wrote:
>>> >>> >>
>>> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use
cases, I
>>> don't feel strongly about it. I would still suggest doing the following:
>>> >>> >>
>>> >>> >> 1. Link the POC mentioned in Q4. So people can verify the
POC
>>> result.
>>> >>> >> 2. List public APIs we plan to expose in Appendix A. I
did a
>>> quick check. Beside ColumnarBatch and ColumnarVector, we also need to make
>>> the following public. People who are familiar with SQL internals should
>>> help assess the risk.
>>> >>> >> * ColumnarArray
>>> >>> >> * ColumnarMap
>>> >>> >> * unsafe.types.CaledarInterval
>>> >>> >> * ColumnarRow
>>> >>> >> * UTF8String
>>> >>> >> * ArrayData
>>> >>> >> * ...
>>> >>> >> 3. I still feel using Pandas UDF as the mid-term success
doesn't
>>> match the purpose of this SPIP. It does make some code cleaner. But I guess
>>> for ETL use cases, it won't bring much value.
>>> >>> >>
>>> >>> > --
>>> >>> > Twitter: https://twitter.com/holdenkarau
>>> >>> > Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Mime
View raw message