Actually, good question, I'm not sure. I don't think that Spark would vectorize these operations over rows.Whereas in a pandas UDF, given a DataFrame, you can apply operations like sin to 1000s of values at once in native code via numpy. It's trivially 'vectorizable' and I've seen good wins over, at least, a single-row UDF.On Fri, Apr 9, 2021 at 9:14 AM ayan guha <email@example.com> wrote:Hi Sean - absolutely open to suggestions.My impression was using spark native functions should provide similar perf as scala ones because serialization penalty should not be there, unlike native python udfs.Is it wrong understanding?--On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru <firstname.lastname@example.org> wrote:
From: Sean Owen <email@example.com>
Sent: Friday, April 9, 2021 6:11 PM
To: ayan guha <firstname.lastname@example.org>
Cc: Rao Bandaru <email@example.com>; User <firstname.lastname@example.org>
Subject: Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframeThis can be significantly faster with a pandas UDF, note, because you can vectorize the operations.
On Fri, Apr 9, 2021, 7:32 AM ayan guha <email@example.com> wrote:
We are using a haversine distance function for this, and wrapping it in udf.
from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *
def haversine_distance(long_x, lat_x, long_y, lat_y):
sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
cos(toRadians(long_x) - toRadians(long_y))
) * lit(6371.0)
distudf = udf(haversine_distance, FloatType())
in case you just want to use just Spark SQL, you can still utilize the functions shown above to implement in SQL.
Any reason you do not want to use UDF?
On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <firstname.lastname@example.org> wrote:
I have a requirement to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe with the help of from geopy import distance without using UDF (user defined function),Please help how to achieve this scenario and do the needful.
Ankamma Rao B