spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: How to get recent value in spark dataframe
Date Mon, 19 Dec 2016 09:36:21 GMT
You have 2 parts to it

1. Do a sub query where for each primary key derive latest value of flag=1
records. Ensure you get exactly 1 record per primary key value. Here you
can use rank() over (partition by primary key order by year desc)

2. Join your original dataset with the above on primary key. If year is
higher than latest flag=1 record then take it else mark it null

If you can have primary keys which may have no flag=1 records then they
wouldn't show up in set 1 above. So if you still want them in result then
adjust 1 accordingly.

Best
Ayan
On Mon., 19 Dec. 2016 at 1:01 pm, Richard Xin
<richardxin168@yahoo.com.invalid> wrote:

> I am not sure I understood your logic, but it seems to me that you could
> take a look of Hive's Lead/Lag functions.
>
>
> On Monday, December 19, 2016 1:41 AM, Milin korath <
> milin.korath@impelsys.com> wrote:
>
>
> thanks, I tried with left outer join. My dataset having around 400M
> records and lot of shuffling is happening.Is there any other workaround
> apart from Join,I tried use window function but I am not getting a proper
> solution,
>
>
> Thanks
>
> On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust <michael@databricks.com>
> wrote:
>
> Oh and to get the null for missing years, you'd need to do an outer join
> with a table containing all of the years you are interested in.
>
> On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
> Are you looking for argmax? Here is an example
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html>
> .
>
> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath <milin.korath@impelsys.com>
> wrote:
>
> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>
>   a   0    100  2015
>
>   a   0    50   2015
>
>   a   1    200  2014
>
>   a   1    300  2013
>
>   a   0    400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>       id  flag price date new_column
>
>       a   0    100  2015    200
>
>       a   0    50   2015    200
>
>       a   1    200  2014    null
>
>       a   1    300  2013    null
>
>       a   0    400  2012    null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message