spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ranjan, Abhinav" <abhinav.ranjan...@gmail.com>
Subject override collect_list
Date Wed, 27 Nov 2019 05:57:49 GMT
Hi all,

I want to collect some rows in a list by using the spark's collect_list 
function.

However, the no. of rows getting in the list is overflowing the memory. 
Is there any way to force the collection of rows onto the disk rather 
than in memory, or else instead of collecting it as a list, collect it 
as a list of list so as to avoid collecting it whole into the memory.

*_/ex: df as:/_*

*id        col1    col2*

1        as        sd

1        df        fg

1        gh        jk

2        rt        ty

*_/df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))/_*

*id        col3*

1        [(as,sd),(df,fg),(gh,jk)]

2        [(rt,ty)]


so if id=1 is having too much rows than the list will overflow. How to 
avoid this scenario?


Thanks,

Abhnav



Mime
View raw message