spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luca Canali (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-30153) Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization
Date Fri, 06 Dec 2019 16:12:00 GMT
Luca Canali created SPARK-30153:
-----------------------------------

             Summary: Extend data exchange options for vectorized UDF functions with vanilla
Arrow serialization
                 Key: SPARK-30153
                 URL: https://issues.apache.org/jira/browse/SPARK-30153
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 3.0.0
            Reporter: Luca Canali


Spark has introduced vectorized UDF with pandas_udf and this provides considerable speed up
by reducing the overhead due to serialization and deserialization, where applciable.
The current implementation of pandas_udf uses Arrow for fast serialization and then Pandas
Series (or Pandas DF) for processing.
There are opportunities to improve UDF performance, in certain cases, by bypaasing the conversion
to and from Pandas and using Arrow Tables, directly with the help of specialized libraries
able to process Arrow Tables and Arrays.
One such case is for scientific computing of high energy physics data, where processing of
arrays of data is of key importance.
A test case using such approach has shown an increase of performance of about 3x, compared
to the equivalent processing with pandas_udf, for a UDF based on plain Arrow serialization
using a custom-developed extension of pandas_udf.  Processing of Arrow data in the test case
was done via the "awkward arrays" library (https://github.com/scikit-hep/awkward-array).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message