spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: increasing cross join speed
Date Thu, 02 Feb 2017 06:18:04 GMT
Hi,

I'm not sure how to improve this kind of queries only on vanilla spark
though,
you can write custom physical plans for top-k queries.
You can check the link below as a reference;
benchmark: https://github.com/apache/incubator-hivemall/pull/33
manual:
https://github.com/apache/incubator-hivemall/blob/master/docs/gitbook/spark/misc/topk_join.md

I hope this helps for you.
Thanks,

// maropu


On Wed, Feb 1, 2017 at 6:35 AM, Kürşat Kurt <kursat@kursatkurt.com> wrote:

> Hi;
>
>
>
> I have 2 dataframes. I am trying to cross join for finding vector
> distances. Then i can choose the most similiar vectors.
>
> Cross join speed is too slow. How can i increase the speed, or have you
> any suggestion for this comparision?
>
>
>
>
>
> *val* result=myDict.join(mainDataset).map(x=>{
>
>
>
>                *val* orgClassName1 =x.getAs[SparseVector](1);
>
>                *val* orgClassName2 =x.getAs[SparseVector](2);
>
>                *val* f1=x.getAs[SparseVector](3);
>
>                *val* f2=x.getAs[SparseVector](4);
>
>                *val* dist=Vectors.sqdist(f1,f2);
>
>
>
>                (orgClassName1, orgClassName2,dist)
>
>              }).toDF("orgClassName1","orgClassName2,"dist");
>
>
>
>
>
>
>



-- 
---
Takeshi Yamamuro

Mime
View raw message