spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 颜发才(Yan Facai) <yaf...@gmail.com>
Subject Re: Spark joins using row id
Date Sun, 13 Nov 2016 10:51:45 GMT
pairRDD can use (hash) partition information to do some optimizations when
joined, while I am not sure if dataset could.

On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma <rohit.verma@rokittech.com>
wrote:

> For datasets structured as
>
> ds1
> rowN col1
> 1       A
> 2       B
> 3       C
> 4       C
> …
>
> and
>
> ds2
> rowN col2
> 1       X
> 2       Y
> 3       Z
> …
>
> I want to do a left join
>
> Dataset<Row> joined = ds1.join(ds2,”rowN”,”left outer”);
>
> I somewhere read in SO or this mailing list that if spark is aware of
> datasets being sorted it will use some optimizations for joins.
> Is it possible to make this join more efficient/faster.
>
> Rohit

Mime
View raw message