spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Usage of PyArrow in Spark
Date Wed, 17 Jul 2019 04:18:59 GMT
Hi,
In spark 2.3+ I saw that pyarrow was being used in a bunch of places in
spark. And I was trying to understand the benefit in terms of serialization
/ deserializaiton it provides.

I understand that the new pandas-udf works only if pyarrow is installed.
But what about the plain old PythonUDF which can be used in map() kind of
operations?
Are they also using pyarrow under the hood to reduce the cost is serde? Or
do they remain as earlier and no performance gain should be expected in
those?

If I'm not mistaken, plain old PythonUDFs could also benefit from Arrow as
the data transfer cost to serialize/deserialzie from Java to Python and
back still exists and could potentially be reduced by using Arrow?
Is my understanding correct? Are there any plans to implement this?

Pointers to any notes or Jira about this would be appreciated.

Mime
View raw message