spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rao Bandaru <rao.m...@outlook.com>
Subject Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe
Date Fri, 09 Apr 2021 12:55:23 GMT
Hi All,

yes ,i need to add the below scenario based code to the executing spark job,while executing
this it took lot of time to complete,please suggest best way to get below requirement without
using UDF

Thanks,
Ankamma Rao B
________________________________
From: Sean Owen <srowen@gmail.com>
Sent: Friday, April 9, 2021 6:11 PM
To: ayan guha <guha.ayan@gmail.com>
Cc: Rao Bandaru <rao.msbi@outlook.com>; User <user@spark.apache.org>
Subject: Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1,
Latitude2, Longtitude2) in the pysaprk dataframe

This can be significantly faster with a pandas UDF, note, because you can vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.ayan@gmail.com<mailto:guha.ayan@gmail.com>>
wrote:
Hi

We are using a haversine distance function for this, and wrapping it in udf.

from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
from pyspark.sql.types import *

def haversine_distance(long_x, lat_x, long_y, lat_y):
    return acos(
        sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
        cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
            cos(toRadians(long_x) - toRadians(long_y))
    ) * lit(6371.0)

distudf = udf(haversine_distance, FloatType())

in case you just want to use just Spark SQL, you can still utilize the functions shown above
to implement in SQL.

Any reason you do not want to use UDF?

Credit<https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>

On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.msbi@outlook.com<mailto:rao.msbi@outlook.com>>
wrote:
Hi All,



I have a requirement to calculate distance between four coordinates(Latitude1, Longtitude1,
Latitude2, Longtitude2) in the pysaprk dataframe with the help of from geopy import distance
without using UDF (user defined function),Please help how to achieve this scenario and do
the needful.



Thanks,

Ankamma Rao B


--
Best Regards,
Ayan Guha

Mime
View raw message