There are several ways I can compute the cosine similarities between a Spark ML vector to each ML vector in a Spark DataFrame column then sorting for the highest results.  However, I can't come up with a method that is faster than replacing the `/data/` in a Spark ML Word2Vec model, then using `.findSynonyms()`.  The problem is the Word2Vec model is held entirely in the driver which can cause memory issues if the data set I want to compare to gets too big.

1. Is there a more efficient method than the ones I have shown below?
2. Could the data for the Word2Vec model be distributed across the cluster?
3. Could the the `.findSynonyms()` [Scala code]( be modified to make a spark sql function that can operate efficiently over a whole Spark DataFrame?

Methods I have tried:

#1 rdd function:

    # vecIn = vector of same dimensions as 'vectors' column
    def cosSim(row, vecIn):
        return (
            tuple(( Vectors.dense( Vectors.dense( /
                        (Vectors.dense(np.sqrt( *
                ).toArray().tolist())) row: cosSim(row, vecIn)).toDF(['CosSim']).show(truncate=False)


#2  `.toIndexedRowMatrix().columnSimilarities()` then filter the results (not shown):

        IndexedRowMatrix( row: (row.vectors.toArray())))


#3 replace Word2Vec model `/data/` with my own, then load 'revised' model and use `.findSynonyms()`:

    ## StructType(List(StructField(word,StringType,true),StructField(vector,ArrayType(FloatType,true),true)))
    df_words_vectors.write.parquet("exiting_Word2Vec_model/data/", mode='overwrite')
    new_Word2Vec_model = Word2VecModel.load("exiting_Word2Vec_model")
    ## vecIn = vector of same dimensions as 'vector' column in DataFrame saved over Word2Vec model /data/
    new_Word2Vec_model.findSynonyms(vecIn, 20).show()




Clay Stevens