spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe
Date Sat, 10 Apr 2021 01:51:16 GMT
Hi - interesting stuff. My stand always was to use spark native functions,
pandas and python native - in this order.

To OP - did you try the code? What kind of perf are you seeing? Just
curious, why do you think UDFs are bad?

On Sat, 10 Apr 2021 at 2:36 am, Sean Owen <srowen@gmail.com> wrote:

> Actually, good question, I'm not sure. I don't think that Spark would
> vectorize these operations over rows.
> Whereas in a pandas UDF, given a DataFrame, you can apply operations like
> sin to 1000s of values at once in native code via numpy. It's trivially
> 'vectorizable' and I've seen good wins over, at least, a single-row UDF.
>
> On Fri, Apr 9, 2021 at 9:14 AM ayan guha <guha.ayan@gmail.com> wrote:
>
>> Hi Sean - absolutely open to suggestions.
>>
>> My impression was using spark native functions should provide similar
>> perf as scala ones because serialization penalty should not be there,
>> unlike native python udfs.
>>
>> Is it wrong understanding?
>>
>>
>>
>> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru <rao.msbi@outlook.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>> yes ,i need to add the below scenario based code to the executing spark
>>> job,while executing this it took lot of time to complete,please suggest
>>> best way to get below requirement without using UDF
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>> ------------------------------
>>> *From:* Sean Owen <srowen@gmail.com>
>>> *Sent:* Friday, April 9, 2021 6:11 PM
>>> *To:* ayan guha <guha.ayan@gmail.com>
>>> *Cc:* Rao Bandaru <rao.msbi@outlook.com>; User <user@spark.apache.org>
>>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>>> dataframe
>>>
>>> This can be significantly faster with a pandas UDF, note, because you
>>> can vectorize the operations.
>>>
>>> On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>> Hi
>>>
>>> We are using a haversine distance function for this, and wrapping it in
>>> udf.
>>>
>>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>>> from pyspark.sql.types import *
>>>
>>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>>>     return acos(
>>>         sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>>>         cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>>>             cos(toRadians(long_x) - toRadians(long_y))
>>>     ) * lit(6371.0)
>>>
>>> distudf = udf(haversine_distance, FloatType())
>>>
>>> in case you just want to use just Spark SQL, you can still utilize the
>>> functions shown above to implement in SQL.
>>>
>>> Any reason you do not want to use UDF?
>>>
>>> Credit
>>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>>
>>>
>>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.msbi@outlook.com>
>>> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I have a requirement to calculate distance between four
>>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>>> dataframe *with the help of from *geopy* import *distance *without
>>> using *UDF* (user defined function)*,*Please help how to achieve this
>>> scenario and do the needful.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ankamma Rao B
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
> --
Best Regards,
Ayan Guha

Mime
View raw message