spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Vinokurov <grapesmo...@gmail.com>
Subject Re: Using Percentile in Spark SQL
Date Mon, 11 Nov 2019 15:55:54 GMT
I don't think the Spark configuration is what you want to focus on. It's
hard to say without knowing the specifics of the job or the data volume,
but you should be able to accomplish this with the percent_rank function in
SparkSQL and a smart partitioning of the data. If your data has a lot of
skew, you can end up with a situation in which some executors are waiting
around to do work while others are stuck with processing larger partitions,
so you'll need to take a look at the actual stats of your data and figure
out if there's a more efficient partitioning strategy that you can use.

On Mon, Nov 11, 2019 at 10:34 AM Tzahi File <tzahi.file@ironsrc.com> wrote:

> Currently, I'm using the percentile approx function with Hive.
> I'm looking for a better way to run this function or another way to get
> the same result with spark, but faster and not using gigantic instances..
>
> I'm trying to optimize this job by changing the Spark configuration. If
> you have any ideas how to approach this, it would be great (like instance
> type, number of instances, number of executers etc.)
>
>
> On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy <pmccarthy@dstillery.com>
> wrote:
>
>> Depending on your tolerance for error you could also use
>> percentile_approx().
>>
>> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <grapesmoker@gmail.com>
>> wrote:
>>
>>> Do you mean that you are trying to compute the percent rank of some
>>> data? You can use the SparkSQL percent_rank function for that, but I don't
>>> think that's going to give you any improvement over calling the percentRank
>>> function on the data frame. Are you currently using a user-defined function
>>> for this task? Because I bet that's what's slowing you down.
>>>
>>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tzahi.file@ironsrc.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>>> percentile function. I'm trying to improve this job by moving it to run
>>>> with spark SQL.
>>>>
>>>> Any suggestions on how to use a percentile function in Spark?
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Tzahi File
>>>> Data Engineer
>>>> [image: ironSource] <http://www.ironsrc.com/>
>>>>
>>>> email tzahi.file@ironsrc.com
>>>> mobile +972-546864835
>>>> fax +972-77-5448273
>>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>>> ironsrc.com <http://www.ironsrc.com/>
>>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>>> <https://plus.google.com/+ironsrc>
>>>> This email (including any attachments) is for the sole use of the
>>>> intended recipient and may contain confidential information which may be
>>>> protected by legal privilege. If you are not the intended recipient, or the
>>>> employee or agent responsible for delivering it to the intended recipient,
>>>> you are hereby notified that any use, dissemination, distribution or
>>>> copying of this communication and/or its content is strictly prohibited.
If
>>>> you are not the intended recipient, please immediately notify us by reply
>>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>>
>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.file@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker

Mime
View raw message