spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tzahi File <tzahi.f...@ironsrc.com>
Subject Re: Using Percentile in Spark SQL
Date Mon, 11 Nov 2019 15:33:49 GMT
Currently, I'm using the percentile approx function with Hive.
I'm looking for a better way to run this function or another way to get the
same result with spark, but faster and not using gigantic instances..

I'm trying to optimize this job by changing the Spark configuration. If you
have any ideas how to approach this, it would be great (like instance type,
number of instances, number of executers etc.)


On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy <pmccarthy@dstillery.com>
wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <grapesmoker@gmail.com>
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tzahi.file@ironsrc.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.file@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.file@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Mime
View raw message