spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Sat, 25 May 2019 20:56:01 GMT
Can we push this to June 1st? I have been meaning to read it but
unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <dongjoon.hyun@gmail.com>
wrote:

> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai <dbtsai@dbtsai.com.invalid> wrote:
>
>> +1 on exposing the APIs for columnar processing support.
>>
>> I understand that the scope of this SPIP doesn't cover AI / ML
>> use-cases. But I saw a good performance gain when I converted data
>> from rows to columns to leverage on SIMD architectures in a POC ML
>> application.
>>
>> With the exposed columnar processing support, I can imagine that the
>> heavy lifting parts of ML applications (such as computing the
>> objective functions) can be written as columnar expressions that
>> leverage on SIMD architectures to get a good speedup.
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Wed, May 15, 2019 at 2:59 PM Bobby Evans <revans2@gmail.com> wrote:
>> >
>> > It would allow for the columnar processing to be extended through the
>> shuffle.  So if I were doing say an FPGA accelerated extension it could
>> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
>> input instead of a Row. The extended version of the ShuffleExchangeExec
>> could then do the partitioning on the incoming batch and instead of
>> producing a ShuffleRowRDD for the exchange they could produce something
>> like a ShuffleBatchRDD that would let the serializing and deserializing
>> happen in a column based format for a faster exchange, assuming that
>> columnar processing is also happening after the exchange. This is just like
>> providing a columnar version of any other catalyst operator, except in this
>> case it is a bit more complex of an operator.
>> >
>> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid
>> <irashid@cloudera.com.invalid> wrote:
>> >>
>> >> sorry I am late to the discussion here -- the jira mentions using this
>> extensions for dealing with shuffles, can you explain that part?  I don't
>> see how you would use this to change shuffle behavior at all.
>> >>
>> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves <tgraves@apache.org>
>> wrote:
>> >>>
>> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
>> >>> and other people feedback who haven't had time to look at it.
>> >>>
>> >>> Tom
>> >>>
>> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <holden@pigscanfly.ca>
>> wrote:
>> >>> >
>> >>> > I’d like to ask this vote period to be extended, I’m interested
but
>> I don’t have the cycles to review it in detail and make an informed vote
>> until the 25th.
>> >>> >
>> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <meng@databricks.com>
>> wrote:
>> >>> >>
>> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases,
I
>> don't feel strongly about it. I would still suggest doing the following:
>> >>> >>
>> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC
>> result.
>> >>> >> 2. List public APIs we plan to expose in Appendix A. I did
a quick
>> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
>> following public. People who are familiar with SQL internals should help
>> assess the risk.
>> >>> >> * ColumnarArray
>> >>> >> * ColumnarMap
>> >>> >> * unsafe.types.CaledarInterval
>> >>> >> * ColumnarRow
>> >>> >> * UTF8String
>> >>> >> * ArrayData
>> >>> >> * ...
>> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't
>> match the purpose of this SPIP. It does make some code cleaner. But I guess
>> for ETL use cases, it won't bring much value.
>> >>> >>
>> >>> > --
>> >>> > Twitter: https://twitter.com/holdenkarau
>> >>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message