I've been trying to achieve the same objective, coming= up with approaches similar to your method 1 and 2. Method 2 is the slowest= for me due to massive amount of data being shuffled around at each matrix = operation stage. Method 3 is new to me, so I can't comment much.
I ended up using an approach that is similar to your method 1,= which gives reasonable performance in my use case.

#4 Normalizer then UDF (PySpark code)
```
norm= aliser =3D Normalizer(inputCol=3D"vec", outputCol=3D"norm_ve= c")
df_word_norm =3D normaliser.transform(df_word)
=

dot_udf =3D F.udf(lambda x,y: float(x.dot(y)), DoubleTy= pe())
df_score =3D df_word_norm.withColumn("score",= dot_udf(df_word_norm.norm_vec1, df_word_norm.norm_vec2))
# n= orm_vec1 and norm_vec2 come from a Cartesian join. Steps=C2=A0to produce th= em are not shown for brevity.
```

Would = be curious to learn how other people solve this problem.

Best wishes,
Chee Yee

On Tue, 24 Sep 2019 at 04:20, S= tevens, Clay <Clay.Ste= vens@wolterskluwer.com> wrote:

There are several ways I can c= ompute the cosine similarities between a Spark ML vector to each ML vector = in a Spark DataFrame column then sorting for the highest results.=C2=A0 However, I can't come up with a method = that is faster than replacing the `/data/` in a Spark ML Word2Vec model, th= en using `.findSynonyms()`.=C2=A0 The problem is the Word2Vec model is held= entirely in the driver which can cause memory issues if the data set I want to compare to gets too big.

1.=C2=A0Is there a more efficient method than the ones I have shown be= low?
2.=C2=A0Could the data for the Word2Vec model be distributed across th= e cluster?
3.=C2=A0Could the the `.findSynonyms()` [Scala code](
<= span style=3D"font-size:12pt;font-family:Arial,sans-serif;color:rgb(17,85,2= 04);background:white">https://github.com/apache/spark/blob/master/mllib/src= /main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L571toL619= ) be modified to make a spark sql function that can operate efficiently over= a whole Spark DataFrame?

Methods I have tried:

#1 rdd function:
```

=C2=A0 =C2=A0 # vecI= n =3D vector of same dimensions as 'vectors' column
=C2=A0 =C2=A0 def cosSim(row, vecIn):
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 tuple(( Vectors.dense( Vectors.de= nse(row.vectors.dot(vecIn)) /
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 (Vectors.dense(np.sqrt(row.vectors.dot(row.vectors))) *
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 Vectors.dense(np.sqrt(vecIn.dot(vecIn)))))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ).toArray().tolist(= )))
=C2=A0 =C2=A0
=C2=A0 =C2=A0 df.rdd.map(lambda row: cosSim(row, vecIn)).toDF(['CosSim&= #39;]).show(truncate=3DFalse)

```

#2=C2=A0 `.toIndexedRowMatrix().columnS= imilarities()` then filter the results (not shown):
=C2=A0
```

=C2=A0 =C2=A0 spark.= createDataFrame(
=C2=A0 =C2=A0 =C2=A0 =C2=A0 IndexedRowMatrix(df.rdd.map(lambda row: (row.ve= ctors.toArray())))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .toBlockMatrix()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .transpose()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .toIndexedRowMatrix()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .columnSimilarities()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 .entries)

```

#3 replace Word2Vec model `/data/` with my own, then load 'revised&#= 39; model and use `.findSynonyms()`:
```

=C2=A0 =C2=A0 df_wor= ds_vectors.schema
=C2=A0 =C2=A0 ## StructType(List(StructField(word,StringType,true),StructFi= eld(vector,ArrayType(FloatType,true),true)))
=C2=A0 =C2=A0
=C2=A0 =C2=A0 df_words_vectors.write.parquet("exiting_Word2Vec_model/d= ata/", mode=3D'overwrite')
=C2=A0 =C2=A0
=C2=A0 =C2=A0 new_Word2Vec_model =3D Word2VecModel.load("exiting_Word2= Vec_model")
=C2=A0 =C2=A0
=C2=A0 =C2=A0 ## vecIn =3D vector of same dimensions as 'vector' co= lumn in DataFrame saved over Word2Vec model /data/
=C2=A0 =C2=A0 new_Word2Vec_model.findSynonyms(vecIn, 20).show()

```

=C2=A0

=C2=A0

Clay Stevens

--0000000000000591a705934242c3--