spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bobby Evans <reva...@gmail.com>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Wed, 15 May 2019 21:58:57 GMT
It would allow for the columnar processing to be extended through the
shuffle.  So if I were doing say an FPGA accelerated extension it could
replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
input instead of a Row. The extended version of the ShuffleExchangeExec
could then do the partitioning on the incoming batch and instead of
producing a ShuffleRowRDD for the exchange they could produce something
like a ShuffleBatchRDD that would let the serializing and deserializing
happen in a column based format for a faster exchange, assuming that
columnar processing is also happening after the exchange. This is just like
providing a columnar version of any other catalyst operator, except in this
case it is a bit more complex of an operator.

On Wed, May 15, 2019 at 12:15 PM Imran Rashid <irashid@cloudera.com.invalid>
wrote:

> sorry I am late to the discussion here -- the jira mentions using this
> extensions for dealing with shuffles, can you explain that part?  I don't
> see how you would use this to change shuffle behavior at all.
>
> On Tue, May 14, 2019 at 10:59 AM Thomas graves <tgraves@apache.org> wrote:
>
>> Thanks for replying, I'll extend the vote til May 26th to allow your
>> and other people feedback who haven't had time to look at it.
>>
>> Tom
>>
>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <holden@pigscanfly.ca>
>> wrote:
>> >
>> > I’d like to ask this vote period to be extended, I’m interested but I
>> don’t have the cycles to review it in detail and make an informed vote
>> until the 25th.
>> >
>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <meng@databricks.com>
>> wrote:
>> >>
>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't
>> feel strongly about it. I would still suggest doing the following:
>> >>
>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
>> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
>> following public. People who are familiar with SQL internals should help
>> assess the risk.
>> >> * ColumnarArray
>> >> * ColumnarMap
>> >> * unsafe.types.CaledarInterval
>> >> * ColumnarRow
>> >> * UTF8String
>> >> * ArrayData
>> >> * ...
>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match
>> the purpose of this SPIP. It does make some code cleaner. But I guess for
>> ETL use cases, it won't bring much value.
>> >>
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message